Do language barriers undermine AI’s role in global health communication?

Access to reliable health information is critical for public health, but the quality of such information often varies significantly across languages. Disparities in national healthcare policies, cultural nuances, and the availability of accurate data in non-English languages exacerbate the issue.

Do language barriers undermine AI’s role in global health communication?
Representative Image. Credit: ChatGPT

In the age of multilingual communication and global reliance on digital health tools, the role of Large Language Models (LLMs) in providing accurate and consistent health-related information is under scrutiny. The study "Do LLMs Provide Consistent Answers to Health-Related Questions Across Languages?" by Ipek Baris Schlicht, Zhixue Zhao, Burcu Sayin, Lucie Flek, and Paolo Rosso, submitted on arXiv (2025), examines how LLMs perform in responding to health-related inquiries across English, German, Turkish, and Chinese. This groundbreaking research highlights inconsistencies in LLM outputs that could propagate misinformation and amplify inequities in healthcare communication.

The need for multilingual consistency in healthcare

Access to reliable health information is critical for public health, but the quality of such information often varies significantly across languages. Disparities in national healthcare policies, cultural nuances, and the availability of accurate data in non-English languages exacerbate the issue. The study identifies two core questions: How do LLM responses differ when the same health-related query is posed in English versus other languages? And are there specific disease-related topics where these inconsistencies are more pronounced?

By expanding the HealthFC dataset to include Turkish and Chinese alongside English and German, the researchers investigate the consistency of LLM responses, using a novel evaluation framework to parse and compare long-form answers across languages.

Methodology: Parsing and Evaluating Consistency

The study employs a prompt-based evaluation workflow to dissect LLM-generated responses into distinct informational units:The framework breaks down responses into key sections such as the Answer Summary (AS), which directly addresses the query, and Health Benefits and Outcomes (HBO), which detail positive effects or results related to a medical intervention.

Other sections include Clinical Guidelines and Evidence (CGE) for established recommendations, Individual Considerations/Caveats (ICC) for personalized advice, and Public Health/Professional Advice (PHPA) to emphasize consultation with healthcare professionals. By segmenting responses, the researchers were able to identify not just outright contradictions but also nuanced disparities in depth and relevance across languages.

Each segment was evaluated based on whether the answers in non-English languages were consistent, partially consistent, contradictory, or irrelevant when compared to English. This meticulous approach provided insights into both glaring inconsistencies and subtle differences that could influence user trust in LLM-generated responses.

Key findings: Patterns of inconsistency

Varying Depth and Detail Across Languages

English responses generally provided more detailed and comprehensive information, frequently including references to research studies, statistical details, and precise guidelines. By contrast, Turkish and Chinese responses often lacked these elements, offering less depth or omitting critical details altogether. While some omissions were due to cultural adaptations, others highlighted gaps in dataset quality and training for non-English languages.

Disease-Specific Discrepancies

Inconsistencies were particularly pronounced in disease categories such as circulatory and digestive systems, as well as endocrine and metabolic disorders. These variations were most evident in sections addressing clinical guidelines and health outcomes. For example, responses in one language might reference general advice, while another provided specific recommendations, leading to disparities in the quality of information shared.

Accuracy vs. Context

Although some responses were technically accurate, they often failed to effectively address cultural or linguistic nuances. For instance, dietary advice in Turkish and Chinese responses substituted culturally relevant examples, which sometimes diverged from the original English guidance. This adaptation was helpful in some contexts but occasionally resulted in oversimplifications that reduced the usefulness of the advice.

Contradictions and Omissions

Contradictions between English and non-English responses were observed in areas such as medication effects or dietary recommendations. In some cases, non-English responses provided conflicting advice or omitted key points entirely, particularly in sections like public health guidelines or individualized care. These gaps underscored the challenges of ensuring parity in multilingual outputs.

Implications and recommendations for multilingual AI in healthcare

Inconsistent or incomplete LLM responses pose a significant risk of spreading misinformation and undermining trust in AI systems, particularly in non-English-speaking regions. These inconsistencies are often rooted in biased training datasets that prioritize high-resource languages like English and German over others. The disparities also highlight the ethical challenges of deploying LLMs in sensitive fields like healthcare, where even small inaccuracies can have serious consequences. To ensure LLMs contribute positively to global healthcare, it is essential to address these biases and improve the reliability of their multilingual outputs.

Enhancing multilingual training is crucial for improving the performance of LLMs. Developers need to invest in creating comprehensive datasets that encompass diverse languages, medical systems, and cultural contexts. Cultural adaptation must strike a balance between relevance and accuracy, ensuring that advice retains its depth and precision even when tailored to local norms. Regular audits and independent quality checks of LLM-generated responses can further identify inconsistencies and guide improvements. Additionally, integrating human oversight into the decision-making process can ensure that AI-generated advice is both accurate and contextually appropriate, reinforcing user trust in these systems.

  • FIRST PUBLISHED IN:
  • Devdiscourse

TRENDING

OPINION / BLOG / INTERVIEW

AI forecasting can cut blind spots in medicine supply chains

Climate stress turns migration into a survival strategy in vulnerable nations

Saudi Arabia’s data protection push faces enforcement gaps despite strong legal foundations

Workplace AI coaching needs rules before results

DevShots

Latest News

Connect us on

LinkedIn Quora Youtube RSS
Give Feedback