Home » Researchers discover ‘Deceptive Delight’ to jailbreak LLM

Researchers discover ‘Deceptive Delight’ to jailbreak LLM

Cybersecurity researchers have discovered a new technique that could allow attackers to jailbreak large language models during conversations. The method, called Deceptive Delight, has an average success rate of 64.6% within just three turns of interaction. Deceptive Delight works by gradually leading the AI model to generate unsafe or harmful content.

It exploits the model’s limited attention span, which makes it hard for the AI to fully understand the context when responses blend harmless and potentially dangerous material. Researchers tested eight AI models using 40 unsafe topics in categories like hate, harassment, self-harm, sexual content, violence, and dangerous information. They found the highest success rates for topics related to violence.

Jailbreaking AI with deceptive techniques

The average harmfulness and quality of responses increased significantly from the second to third turn of conversation. To help protect against Deceptive Delight, experts recommend using strong defenses, carefully designing prompts, and clearly defining acceptable inputs and outputs for the AI models.

However, they caution that it may not be possible to make language models completely immune to jailbreaking and hallucinations. Recent studies have also shown that AI models are vulnerable to “package confusion” attacks. In these attacks, malicious actors generate fake software packages, hide malware in them, and upload them to open-source code repositories.

According to the researchers, “The average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names.”

As AI systems become more widely used, identifying and fixing weaknesses like those revealed by Deceptive Delight will be essential. Cybersecurity professionals will need to stay alert and proactively develop strong protection strategies to keep AI technologies secure.

Rashan Dixon

Rashan is a seasoned technology journalist and visionary leader serving as the Editor-in-Chief of DevX.com, a leading online publication focused on software development, programming languages, and emerging technologies. With his deep expertise in the tech industry and her passion for empowering developers, Rashan has transformed DevX.com into a vibrant hub of knowledge and innovation. Reach out to Rashan at [email protected]

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.