How LLMs Improve Fairness and Visual Diversity in AI Image Generation Through Prompting
A study by Hof University shows that using large language models to revise prompts can significantly reduce bias in AI-generated images. This approach enhances diversity and image quality without modifying the image generators themselves.

In a groundbreaking study led by René Peinl at Hof University of Applied Sciences, with contextual references to academic efforts from TU Darmstadt and the Australian Academy of the Humanities, researchers have demonstrated that large language models (LLMs) like ChatGPT, Claude, Mistral, and Llama can play a critical role in reducing biases in AI-generated images. Rather than altering the image generators themselves, the study proposes using these LLMs to modify user prompts before they are fed into diffusion models like Stable Diffusion 3.5, Dreamshaper XL, and Flux. The findings indicate that this method significantly enhances the diversity of outputs across dimensions such as gender, ethnicity, age, and disability, providing a practical pathway for more ethical and accurate AI imagery.
Rewriting Prompts to Rebalance Representation
The core concept is deceptively simple: when AI systems are asked to generate images from neutral prompts like “a doctor” or “a happy couple” they tend to fall back on stereotypical defaults. These include overwhelmingly white, male, or able-bodied characters, regardless of actual population demographics. Peinl defines this kind of bias as an unfair deviation from statistical norms given a neutral prompt. The study challenges previous research that treats representational equality as an absolute goal across all contexts and emphasizes that demographic baselines must be applied with nuance, recognizing, for instance, that women make up 50% of doctors in Germany but only around 3% of professional firefighters.
By inserting LLMs into the workflow to revise user prompts with additional demographic context and visual detail, the researchers were able to double the representation of people of color and substantially improve gender balance. For example, a prompt for “a firefighter” that previously produced only men began returning images with mixed gender representation, while prompts for “a poor person” or “a model” became more inclusive of different body types, ages, and ethnicities.
LLMs Add More Than Just Fairness
An unexpected and welcome finding from the study is that diversity-enhanced prompts didn’t just improve representation, they often led to more visually compelling images. When prompted with "an African house," the original outputs were repetitive and stereotypical, often limited to brown huts with minimal variation. Once LLMs like ChatGPT modified the prompts to include regional specificity or cultural richness, the resulting images showcased a broader spectrum of architectural styles and colors. This added a layer of creativity and cultural depth that the original prompts lacked.
ChatGPT and Claude were particularly effective at generating nuanced and visually coherent prompts, often enhancing user intent with clear, descriptive language. In contrast, Llama and Mistral sometimes struggled, adding vague references to diversity without producing clear visual distinctions. However, even their outputs improved representation when compared to the unmodified prompts.
Challenges of Specificity: When Good Prompts Go Wrong
Despite the successes, the study uncovered some important limitations. The LLMs occasionally failed to respect the user's original intent, particularly in detailed or adversarial prompts. For instance, in responding to prompts like “Doctor Strange holding a magical relic,” some LLMs removed key character traits or inserted group diversity where an individual was intended. In one case, ChatGPT’s modified prompt led to images of people with bird heads because the language became too abstract. Similarly, Mistral once produced four female kings in response to a prompt for “the king of England,” blending diversity goals with historical inaccuracy.
This reveals a tension between ethical prompting and instruction fidelity. When LLMs are asked to avoid bias, they sometimes overcorrect or dilute the user's creative vision. This is especially problematic for prompts tied to history, religion, or specific fictional characters, where accuracy is as important as fairness.
Speed Bumps and Technical Constraints
The method, while powerful, isn’t without practical drawbacks. Incorporating an LLM to revise prompts introduces latency up to 143% longer image generation times in the case of faster diffusion models like Dreamshaper XL. While this lag is less pronounced with slower models such as Flux, it remains a barrier to real-time applications. Peinl suggests training smaller LLMs like Phi-4 or Mistral Small on high-quality outputs from ChatGPT and Claude to preserve effectiveness while reducing processing time. However, early tests show these lightweight models lack the bias sensitivity of their larger counterparts, often resorting to generic "diverse" wording without meaningful visual outcomes.
Token limits on models like CLIP also caused issues when LLMs produced overly long prompts. In such cases, key parts of the prompt were truncated, leading to incomplete or flawed image results. While this was mitigated by manual intervention during the study, it would require automation for real-world scalability.
A Vision for Ethical and Engaging Generative AI
Peinl concludes that LLM-driven prompt engineering offers a low-cost, effective, and ethically responsible way to reduce bias in AI-generated imagery. While the approach is most successful with under-specified prompts and older models like SDXL, it remains useful even for more advanced systems like Flux and SD35, which still show bias in raw outputs. In many cases, the LLMs created more thoughtful, varied visual interpretations that users might find more satisfying than the repetitive images typically produced by default settings.
Going forward, Peinl suggests integrating region-specific demographic defaults, enabling users to customize system prompts, and even adding reflexive features where image models signal to users when outputs may still reflect bias. Rather than lecturing users about fairness, the AI could gently acknowledge its limitations, enhancing trust and transparency. As AI-generated visuals become increasingly common in media, design, and communication, this research offers a timely, practical solution that blends technological sophistication with ethical foresight. It shows that better language leads to better pictures, and that the pathway to fairer AI may start not in the code, but in the prompt.
- READ MORE ON:
- LLMs
- large language models
- ChatGPT
- AI imagery
- Claude
- Llama
- AI-generated visuals
- FIRST PUBLISHED IN:
- Devdiscourse