Researchers have discovered a new technique that could allow attackers to bypass the safety measures of large language models (LLMs) and generate harmful or malicious content. The method, called “Bad Likert Judge,” was identified by cybersecurity experts at Palo Alto Networks Unit 42. The Bad Likert Judge technique involves asking the LLM to act as a judge and score the harmfulness of a given response using the Likert scale, a rating system commonly used in surveys.
The attacker then prompts the LLM to generate responses that align with different scores on the scale. The reaction with the highest Likert score may contain the desired harmful content. “The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert psychometric scale, a rating scale used to measure a respondent’s agreement or disagreement with a statement,” explained the Unit 42 team, consisting of researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.
“It then asks the LLM to generate responses that align with the scales. The highest Likert scale example can potentially contain the harmful content.”
The researchers tested the Bad Likert Judge technique against six state-of-the-art text-generation LLMs from major tech companies, including Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA. The tests covered various categories, such as hate speech, harassment, self-harm, sexual content, weapons, illegal activities, malware generation, and system prompt leakage.
Bad Likert Judge attack discovered
The results showed that the Bad Likert Judge method could increase the attack success rate (ASR) by more than 60% on average compared to using plain attack prompts. This highlights the potential for adversaries to manipulate LLMs and bypass their safety guardrails.
By leveraging the LLM’s understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s safety guardrails,” the researchers said. To mitigate the risk of such attacks, the researchers emphasize the importance of implementing comprehensive content filtering when deploying LLMs in real-world applications. Their findings indicate that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models.
The discovery of the Bad Likert Judge technique comes shortly after a report from The Guardian revealed another vulnerability in OpenAI’s models. The report showed that the models could be tricked into generating misleading summaries by asking them to summarize web pages containing hidden content. As AI advances rapidly, developing new adversarial techniques underscores the need for robust cybersecurity measures and ongoing efforts to refine safety protocols.
This will help safeguard against the potential misuse of artificial intelligence and ensure its responsible deployment in various applications.
Rashan is a seasoned technology journalist and visionary leader serving as the Editor-in-Chief of DevX.com, a leading online publication focused on software development, programming languages, and emerging technologies. With his deep expertise in the tech industry and her passion for empowering developers, Rashan has transformed DevX.com into a vibrant hub of knowledge and innovation. Reach out to Rashan at [email protected]























