Chatbots may misguide patients with gender-skewed diabetes responses
The most significant discrepancies appeared in DeepSeek’s responses. For female patients, the model often failed to emphasize urgency or specialist referral terms that were present in male-directed outputs. This selective omission could underplay the severity of the condition for women, raising the risk of under-treatment or miscommunication about the seriousness of diabetic complications.
Artificial intelligence-driven chatbots are increasingly being deployed to provide patients with quick medical information, but concerns are rising over whether these tools reproduce bias in sensitive healthcare contexts. A new study published in the Journal of Patient Experience investigates the potential for gender bias in chatbot responses for diabetic retinopathy patients.
The study, titled Chatbots and Diabetes: Is There Gender Bias?, evaluates four large language models (LLMs), ChatGPT-o1, DeepSeek-v3, Gemini 2.0 Flash, and Claude 3.7 Sonnet, by testing how each responds to gendered versions of patient-style queries. The research highlights readability problems, variations in keyword usage, and subtle differences in tone that could affect patient understanding and care.
Do chatbot responses meet patient literacy needs?
The study found that all four models generate content that exceeds U.S. health literacy guidelines. National recommendations set a sixth-grade reading level as the standard for medical communication, yet chatbot outputs were consistently above this threshold.
Gemini produced the most accessible responses, averaging a tenth-grade level, while Claude generated content at a college-grade level. ChatGPT and DeepSeek also scored above recommended ranges, raising concerns about accessibility for patients with limited health literacy. According to federal estimates, more than half of U.S. adults read below an eighth-grade level, which means much of the content produced by these models risks being unusable by the very patients who need it most.
The readability issue cuts to the heart of health equity. If chatbots provide medical explanations that are too complex, patients may misunderstand or fail to act on critical information. For conditions like diabetic retinopathy, where timely detection and treatment are vital to preserving vision, inaccessible guidance can delay care and worsen outcomes.
Where do gender differences appear in chatbot output?
While word counts and general structure did not differ significantly between male and female queries, the study uncovered notable variations in keyword usage across the models. ChatGPT mentioned “Endocrinologist” only in responses to female patients. Claude introduced the term “Diabetic Macular Edema” only in female-directed answers. Gemini included the keyword “Kidney” exclusively in male prompts.
The most significant discrepancies appeared in DeepSeek’s responses. For female patients, the model often failed to emphasize urgency or specialist referral terms that were present in male-directed outputs. This selective omission could underplay the severity of the condition for women, raising the risk of under-treatment or miscommunication about the seriousness of diabetic complications.
The findings suggest that subtle biases in training data or algorithmic processing may influence how information is delivered across genders. Even if unintentional, these differences can shape how patients interpret their condition and whether they seek specialized care. In fields where women have historically experienced delayed diagnoses or underestimation of symptoms, such patterns risk reinforcing existing disparities.
How do tone and empathy differ between responses?
Another key dimension analyzed was tone. Using ChatGPT-4.5 as an independent evaluator, the study found that responses directed toward female patients were more likely to adopt empathetic or warm language, while male-directed responses were more clinical and detached.
On one level, empathy in medical communication can improve patient trust and satisfaction. However, the uneven distribution of warmth and empathy across genders highlights potential stereotyping in how chatbots tailor tone. For women, warmer language may risk softening the perceived seriousness of medical advice. For men, a more clinical tone could discourage engagement or convey a lack of supportive communication.
Despite these tonal differences, ChatGPT-4.5 scored the models between 8.5 and 10 for overall low levels of gender bias. Still, when the researchers manually reviewed 10 clinical keywords, they found meaningful discrepancies in seven of them. This gap between machine evaluation and human assessment underscores the challenge of detecting bias through automated scoring alone.
Implications for healthcare and AI oversight
While chatbots hold promise for rapid and accessible diabetes education, they are not yet ready to serve as standalone patient communication tools. Beyond readability challenges, the evidence of subtle gender bias raises concerns about the reliability and neutrality of AI-generated advice.
Importantly, the authors recommend that healthcare providers remain directly involved in reviewing and contextualizing chatbot outputs. Physicians should not assume that AI-generated explanations are unbiased or universally appropriate. Instead, chatbots may serve best as supplemental tools, where their information is cross-referenced against medical expertise before being delivered to patients.
The research also highlights the potential benefit of combining multiple models for more balanced outputs. The authors suggest that ChatGPT and Gemini may provide the most consistent and reliable patient-facing responses. However, they emphasize that physician oversight is essential to ensure safety and accuracy, particularly in high-stakes conditions like diabetic retinopathy.
- FIRST PUBLISHED IN:
- Devdiscourse

