Home » Researchers uncover new LLM attack method

Researchers uncover new LLM attack method

Researchers have discovered a new technique that could allow attackers to bypass the safety measures of large language models (LLMs) and generate harmful or malicious content. The method, called “Bad Likert Judge,” was identified by cybersecurity experts at Palo Alto Networks Unit 42. The Bad Likert Judge technique involves asking the LLM to act as a judge and score the harmfulness of a given response using the Likert scale, a rating system commonly used in surveys.

The attacker then prompts the LLM to generate responses that align with different scores on the scale. The reaction with the highest Likert score may contain the desired harmful content. “The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert psychometric scale, a rating scale used to measure a respondent’s agreement or disagreement with a statement,” explained the Unit 42 team, consisting of researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

“It then asks the LLM to generate responses that align with the scales. The highest Likert scale example can potentially contain the harmful content.”

The researchers tested the Bad Likert Judge technique against six state-of-the-art text-generation LLMs from major tech companies, including Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA. The tests covered various categories, such as hate speech, harassment, self-harm, sexual content, weapons, illegal activities, malware generation, and system prompt leakage.

Bad Likert Judge attack discovered

The results showed that the Bad Likert Judge method could increase the attack success rate (ASR) by more than 60% on average compared to using plain attack prompts. This highlights the potential for adversaries to manipulate LLMs and bypass their safety guardrails.

By leveraging the LLM’s understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s safety guardrails,” the researchers said. To mitigate the risk of such attacks, the researchers emphasize the importance of implementing comprehensive content filtering when deploying LLMs in real-world applications. Their findings indicate that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models.

The discovery of the Bad Likert Judge technique comes shortly after a report from The Guardian revealed another vulnerability in OpenAI’s models. The report showed that the models could be tricked into generating misleading summaries by asking them to summarize web pages containing hidden content. As AI advances rapidly, developing new adversarial techniques underscores the need for robust cybersecurity measures and ongoing efforts to refine safety protocols.

This will help safeguard against the potential misuse of artificial intelligence and ensure its responsible deployment in various applications.

Rashan Dixon

Rashan is a seasoned technology journalist and visionary leader serving as the Editor-in-Chief of DevX.com, a leading online publication focused on software development, programming languages, and emerging technologies. With his deep expertise in the tech industry and her passion for empowering developers, Rashan has transformed DevX.com into a vibrant hub of knowledge and innovation. Reach out to Rashan at [email protected]

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

Researchers uncover new LLM attack method

Bad Likert Judge attack discovered

Rashan Dixon

About Our Editorial Process

How To Get More Storage on iPhone: Buy iCloud+, Free Up Space & Manage Files (2026)

How To Turn Off iPhone: Power Down, Force Shutdown & Scheduled Restart (2026)

How To Close Apps on iPhone: Force Quit Frozen & Background Apps (2026)

How To AirDrop: Send Files Between iPhone, iPad & Mac Instantly (2026)

How To Delete Apps on iPhone: Remove, Offload & Hide Apps (2026)

How To Update Apps on iPhone: App Store, Auto-Updates & Troubleshooting (2026)

How To Take a Screenshot on iPhone: Every Method for Every Model (2026)

AirPods Not Connecting? How To Fix Pairing Issues on iPhone, Android & Mac (2026)

How To Block a Number on iPhone: Calls, Texts & Unknown Callers (2026)

How To Connect AirPods: iPhone, Android, Mac, PC & Laptop Pairing Guide (2026)

How To Factory Reset MacBook Air & MacBook Pro: Erase and Reinstall macOS (2026)

Nvidia Debuts NemoClaw Agent Stack

How To Restart iPhone: Force Restart a Frozen Phone & Soft Reboot (2026)

How To Factory Reset iPhone: Erase All Data or Reset Settings Only (2026)

How To Screen Record on iPhone: Built-In Recorder & Audio Capture (2026)

How To Clear Cache on iPhone: Safari, Apps & System Cache (2026)

How To Free Up Space on Android: 10 Ways to Clear Storage (2026)

How To Update Apps on Android: Auto-Update and Manual Methods (2026)

How To Find Downloads on Android: Locate Downloaded Files Easily (2026)

How To Track a Phone Location for Free With Google Maps (2026)

How To Use Samsung Smart View to Mirror Your Phone to TV (2026)

How To Block Your Number When Calling on Android: Hide Caller ID (2026)

Omidyar Philanthropy Appoints New Leader

Caching Strategies for High-Traffic Web Applications

AI Agent Self-Edits And Learns

Researchers uncover new LLM attack method

Bad Likert Judge attack discovered

Related Posts

About Our Editorial Process