AI visualization bias: Generative models reinforce gender, age and racial bias in outputs
Bias was observed at two critical stages. First, ChatGPT-4o sometimes added inferred gender or age cues even when the user prompt was neutral. For instance, mentioning attire like a “blazer” disproportionately led to male representations. Second, DALL·E’s interpretation of both specific and neutral prompts often defaulted to generating images of young, white, male professionals, even in cases where this contradicted real-world demographics.
In the age of generative AI, the promise of democratized creativity comes with a caveat: bias coded into the very systems that shape our perceptions. A new study has cast a spotlight on the stereotyping tendencies of popular AI systems, revealing how ChatGPT and DALL·E collectively reinforce outdated societal norms through visual representations.
Conducted by Dirk H.R. Spennemann of Charles Sturt University, the study, titled “Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E?” and published in the journal AI, investigates how large language models and text-to-image generators contribute to the propagation of gender, age, and ethnic stereotypes, often without user intent or awareness.
How does bias enter the AI visualization pipeline?
The study explored a dataset of 770 AI-generated images representing librarians and curators, professions that are traditionally gender-neutral but burdened with entrenched stereotypes. Using ChatGPT-4o to create descriptive prompts, which were then fed to DALL·E for image generation, the research examined how each system introduces or amplifies bias. Importantly, user prompts were kept deliberately vague to avoid seeding bias from the outset.
Bias was observed at two critical stages. First, ChatGPT-4o sometimes added inferred gender or age cues even when the user prompt was neutral. For instance, mentioning attire like a “blazer” disproportionately led to male representations. Second, DALL·E’s interpretation of both specific and neutral prompts often defaulted to generating images of young, white, male professionals, even in cases where this contradicted real-world demographics.
The flow of information revealed a pipeline of prejudice: user prompt → ChatGPT-4o-generated detailed prompt → DALL·E-rendered image. Each layer had the potential to shape visual outcomes in ways that deviated from societal or statistical realities.
What do the numbers reveal about demographic distortion?
The study found that 84.9% of gender-neutral prompts ultimately resulted in male images, even though real-world data from countries like the U.S., U.K., and Australia show librarians and curators are predominantly female (ranging from 70% to over 80% female representation). Furthermore, 67.3% of the DALL·E-rendered images portrayed young individuals, even though the median age in these professions typically falls in the late 40s to early 50s.
The issue becomes more acute when prompts make no mention of ethnicity. In such cases, DALL·E defaulted overwhelmingly to Caucasian representations - 99% of the time. This is especially troubling given the increasing push for diversity in both media representation and professional spaces.
In a supplementary dataset of 200 AI-generated images of women in cultural industries, the pattern was consistent: 92% of ChatGPT-4o prompts lacked age specification, but DALL·E rendered 88% of the women as appearing in their 20s. Not only were older women virtually invisible in the imagery, but this also magnified youth-centric bias that erases the contributions of experienced professionals.
Who is accountable for AI-generated stereotyping?
The research emphasizes that both ChatGPT and DALL·E play a role in reinforcing stereotypes, especially when working in tandem. While DALL·E misrepresents even when given specific cues, such as rendering young faces when asked for middle-aged ones, ChatGPT-4o also introduces assumptions by embedding age or gender references in what were supposed to be neutral prompts.
The issue is compounded by the training data used for both systems. DALL·E was trained on over 250 million image-caption pairs, primarily scraped from the internet - a realm rife with stereotypical imagery and Western-centric content. Despite OpenAI’s implementation of red teaming and moderation protocols, the systems continue to mirror the biases present in society and the internet archives that feed them.
OpenAI has acknowledged some of these limitations. According to its DALL·E 2 and DALL·E 3 system cards, the models tend to default to producing white-passing, Western-looking individuals unless otherwise directed. However, the new study goes further, demonstrating that even with vague prompts, these AI systems default to a homogenized and stereotyped vision of professionalism.
The study warns that such algorithmic bias can shape public perceptions and limit inclusivity in visual media. Visualizations produced by generative AI may subconsciously influence hiring, perception of competence, and representation in industries already grappling with underrepresentation of certain demographic groups.
Implications for AI users and developers
This research highlights a fundamental flaw in the current generative AI paradigm: the illusion of neutrality. Users often believe that unbiased prompts will yield unbiased results, but this study debunks that assumption. Even when human input is consciously neutral, the AI fills in the gaps with default assumptions, assumptions rooted in historical and societal biases embedded in the training data.
To mitigate these biases, the study recommends that users adopt highly specific prompts, explicitly indicating desired diversity in gender, age, and ethnicity. Additionally, AI developers should consider design changes that offer users demographic customization options by default, or even dual visualizations for comparison, particularly in professional settings.
More importantly, the paper calls for a revamp of AI training and validation methodologies. During model development, datasets must be curated with awareness of inherent biases, and evaluation should include large-scale testing to uncover systemic stereotyping tendencies that might be missed in small-sample red-teaming exercises.
- FIRST PUBLISHED IN:
- Devdiscourse

