Skip to content

Emojis can surpass AI chatbot content restrictions

Language models can exploit emojis to circumvent safety measures, resulting in the generation of harmful content, such as discussions and advice on topics typically blocked, including bomb-making and homicide. A novel partnership between China and Singapore presents substantial proof that...

Manipulating Emojis can Evade Content Restrictions in Artificial Intelligence Conversational Agents
Manipulating Emojis can Evade Content Restrictions in Artificial Intelligence Conversational Agents

Emojis can surpass AI chatbot content restrictions

On September 17, 2025, a groundbreaking study was published, shedding light on the impact of emojis on the output of language models. The research, titled "When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity," was conducted by a team of researchers affiliated with the University of Massachusetts Amherst.

The study focused on the modified AdvBench dataset, which comprises 520 instances, using the top 50 toxic and non-duplicate prompts. To make the dataset more relevant, the researchers creatively rewrote harmful prompts using emojis.

The authors developed a scoring system called GPT-Judge to rate the harmfulness of the models' responses. This system was used to test whether rewriting harmful prompts with emojis would increase toxic output across languages.

The researchers tested their hypothesis on seven major models: Gemini-2.0-flash, GPT-4o (2024-08-06), GPT-4-0613, Gemini-1.5-pro, Llama-3-8B-Instruct, Qwen2.5-7B-Instruct (Team 2024b), and Qwen2.5-72B-Instruct (Team 2024a). The study found that emoji-based prompts achieved notably higher Harmful Scores and Harmfulness Ratios than ablated versions.

Interestingly, the effect of emojis carries across languages. Harmful outputs remained high when the textual components of the emoji prompts were translated into Chinese, French, Spanish, and Russian. This suggests that the repeated exposure to emojis in toxic contexts may encourage models to comply with toxic prompts rather than block them.

The authors argue that many frequently-used emojis appear in toxic contexts such as pornography, scams, or gambling. This normalization of the association between emojis and harmful content could be a concern for the future of language models.

The layout of the paper was chaotic, with methodology and tests not clearly delineated. Despite this, the study provides valuable insights into the role of emojis in shaping the output of language models, and it encourages further research in this area.

Read also:

Latest