Study exposes AI’s failure to block offensive text in generated images

The ability of AI models to incorporate text within images has dramatically improved, enabling the generation of signs, captions, and logos that mimic real-world typography.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 26-02-2025 15:43 IST | Created: 26-02-2025 15:43 IST
Study exposes AI’s failure to block offensive text in generated images
Representative Image. Credit: ChatGPT

Artificial intelligence has rapidly transformed image generation, producing highly realistic and visually stunning content. However, alongside this progress, a new and concerning issue has emerged: AI-generated images often contain embedded offensive text, including racial slurs, sexually explicit language, and other harmful content. While much research has gone into preventing inappropriate visual content, the problem of NSFW (Not Safe For Work) text within generated images remains largely unaddressed.

A recent study titled "Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images," conducted by Aditya Kumar, Tom Blanchard, Adam Dziedzic, and Franziska Boenisch from CISPA Helmholtz Center for Information Security and the University of Toronto, exposes the vulnerabilities of state-of-the-art Diffusion Models (DMs) and Vision Auto-Regressive Models (VARs) in generating toxic text. The researchers show that while AI-generated images have become impressively sophisticated, they often include harmful embedded text - a serious oversight with significant ethical and social implications.

Why AI models generate offensive text in images

The ability of AI models to incorporate text within images has dramatically improved, enabling the generation of signs, captions, and logos that mimic real-world typography. However, this technological leap comes with unintended consequences. The study reveals that leading AI models, including Stable Diffusion 3 (SD3), DeepFloyd IF, FLUX, and Infinity, often generate offensive or inappropriate text when given benign prompts or seemingly neutral input requests. This issue arises due to underlying biases in the training data, lack of effective filtering mechanisms, and the inability of current models to distinguish harmful text from safe content.

Existing NSFW detection techniques focus primarily on visual elements, such as explicit imagery or violent content. These solutions, which include filtering training datasets, safety fine-tuning, and modifying loss functions, have been somewhat successful in reducing inappropriate image generation. However, they fail when it comes to identifying and preventing offensive text within images, highlighting a critical blind spot in AI safety research. This gap allows AI-generated content to inadvertently include slurs, offensive jokes, or other toxic phrases, making it difficult to control the output effectively.

For instance, the study presents multiple real-world failures where AI-generated images feature offensive language on street signs, billboards, and product labels. Because these words are embedded within images, they bypass standard text filtering mechanisms, making them harder to detect and censor. This raises serious concerns for public-facing AI applications, as businesses, educators, and social media platforms increasingly rely on AI-generated images.

Why current NSFW mitigation methods fail

The researchers tested existing NSFW mitigation strategies, including pre-processing text prompts, post-processing image detection, and safety fine-tuning techniques, but found that they were ineffective in preventing AI models from generating harmful text. One major issue is that optical character recognition (OCR)-based approaches struggle to detect inappropriate words when AI models introduce spelling distortions or creative fonts. This allows offensive words to evade detection while remaining recognizable to human viewers.

A naive approach to filtering NSFW text involves blocking offensive words at the prompt level before an image is generated. However, the study found that even when offensive words were removed from input prompts, AI models still generated toxic text based on learned associations from training data. In some cases, models "hallucinated" offensive words, producing unexpected harmful content that was not explicitly requested.

Another common method, censoring offensive text post-generation, involves using OCR systems to scan AI-generated images and flag harmful language. However, this approach has significant weaknesses. The study found that current OCR-based toxicity filters failed to detect offensive words in up to 50% of cases due to misspellings, stylized fonts, and background noise in images. The FLUX model, for example, had a 90% OCR accuracy rate, yet still failed to detect 9% of harmful text samples, leaving a considerable portion of offensive content unfiltered.

Perhaps the most significant limitation is that existing NSFW mitigation techniques degrade the quality of benign text generation. The researchers observed that when models were fine-tuned to suppress offensive words, they also lost their ability to generate harmless text accurately, causing unintended distortions in otherwise neutral words. This trade-off makes it difficult to maintain high-quality AI-generated text while ensuring safety, further complicating efforts to control this problem.

A new approach: ToxicBench and CLIP-based safety fine-tuning

To address these challenges, the researchers developed ToxicBench, an open-source benchmark specifically designed to evaluate AI models' ability to generate NSFW text in images. ToxicBench provides a curated dataset of harmful prompts, new evaluation metrics, and a testing pipeline that systematically measures both NSFW text generation and overall text-image quality.

One of the most promising solutions explored in the study is safety fine-tuning of the CLIP text encoder, a key component in many diffusion models. The researchers created a customized dataset that maps NSFW words to syntactically similar but benign alternatives, training the CLIP encoder to replace harmful text with safe variations while preserving image quality. Unlike previous methods, this approach reduces NSFW text generation without significantly affecting benign text output.

Through extensive testing, the researchers found that safety fine-tuning of the text encoder reduced NSFW text occurrences while maintaining high-quality image generation. This suggests that focusing on the text embedding process within AI models may be a more effective way to mitigate toxic text than relying solely on external filtering mechanisms. However, this approach is still in its early stages and does not fully prevent all instances of offensive text generation, indicating the need for further refinements.

Future of AI safety in text-to-image generation

The findings from this study highlight a major challenge in AI safety research: visual generation models are not just about images anymore - they also create embedded text, which is much harder to regulate. As AI becomes more integrated into creative industries, businesses, social media platforms, and policymakers must take proactive steps to address this issue.

One potential direction is the development of multi-modal safety filters that combine text-based and image-based AI moderation techniques. Instead of treating text and image safety as separate problems, future AI models could use real-time content filtering that assesses both visual and textual elements simultaneously. Additionally, policy regulations may need to evolve to ensure that AI developers implement stronger safety mechanisms before deploying generative models to the public.

The study’s introduction of ToxicBench and CLIP-based safety fine-tuning represents an important first step toward making AI-generated content safer. However, as the capabilities of Diffusion Models and Vision Auto-Regressive Models continue to evolve, ongoing research and refinement will be crucial to preventing AI from unintentionally spreading harmful language.

In the coming years, AI researchers, developers, and policymakers must work together to ensure that generative AI remains a tool for creativity rather than a vehicle for misinformation, bias, or offensive content. If left unchecked, the silent spread of toxic AI-generated text could undermine the very benefits these technologies are designed to offer.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback