Can AI maintain accuracy in low-resource languages?

To assess the accuracy of truthfulness evaluations, the study compares three methods: human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Human evaluation involved manual assessments of model responses across different languages, determining whether the responses were truthful and informative.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 18-02-2025 10:36 IST | Created: 18-02-2025 10:36 IST
Can AI maintain accuracy in low-resource languages?
Representative Image. Credit: ChatGPT

The rise of large language models (LLMs) has transformed natural language processing, enabling machines to generate human-like text. However, truthfulness remains a critical concern, particularly in a multilingual landscape where misinformation can propagate across linguistic barriers.

A new study titled "Truth Knows No Language: Evaluating Truthfulness Beyond English" by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, and Rodrigo Agerri, published in arXiv (2025), addresses this challenge by extending the TruthfulQA benchmark to four additional languages: Basque, Catalan, Galician, and Spanish. Their research evaluates how well LLMs maintain truthfulness across different languages and provides a framework for assessing misinformation risks in multilingual AI applications.

Assessing truthfulness beyond English

Most truthfulness evaluations of LLMs have focused exclusively on English, largely due to the availability of benchmarks such as TruthfulQA. However, the study highlights a crucial limitation: AI-generated misinformation can impact global audiences, making it imperative to assess LLM performance in underrepresented languages. To bridge this gap, the researchers professionally translated TruthfulQA into Basque, Catalan, Galician, and Spanish, ensuring linguistic accuracy beyond machine translation. They then evaluated 12 state-of-the-art LLMs, including instruction-tuned and base models from the Llama and Gemma families, using multiple evaluation techniques: human evaluation, multiple-choice metrics, and an LLM-as-a-Judge framework.

Findings reveal that LLMs generally perform best in English and worst in Basque, which is the lowest-resourced language in the study. However, the gap between languages was smaller than expected, suggesting that even in lower-resourced languages, LLMs maintain a baseline level of truthfulness. Notably, larger models outperform smaller ones, contrary to some prior research suggesting that increased model size leads to more hallucinations. This indicates that modern instruction-tuned LLMs benefit from improved post-training alignment, making them more reliable across languages.

Evaluating the effectiveness of LLM-as-a-judge

To assess the accuracy of truthfulness evaluations, the study compares three methods: human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Human evaluation involved manual assessments of model responses across different languages, determining whether the responses were truthful and informative. Meanwhile, multiple-choice metrics used automated scoring based on predefined correct and incorrect answers, a common method in existing benchmarks. However, the study finds that multiple-choice metrics alone are insufficient for evaluating truthfulness, as they fail to capture nuanced explanations and reasoning within responses.

Instead, the LLM-as-a-Judge framework - where another AI model evaluates responses for accuracy - demonstrates stronger correlation with human evaluations across all languages. The researchers trained multiple LLMs to act as judges, testing them against various model-generated responses. The results suggest that LLM-as-a-Judge is a more reliable automatic evaluation method than multiple-choice scoring, especially when evaluating multilingual outputs. Importantly, the study also finds no significant bias when using LLM-as-a-Judge across different model families, reinforcing its robustness as an evaluation metric.

The role of informative responses in truthfulness

One critical insight from the study is the relationship between informativeness and truthfulness. Instruction-tuned models tend to generate more informative answers, whereas base models often produce vague or non-committal responses (e.g., "I have no comment"). This pattern skews traditional truthfulness benchmarks, as uninformative responses are often rated as truthful despite their lack of substantive content. The study emphasizes the need for truthfulness assessments to account for both correctness and informativeness, as overly cautious models may artificially inflate truthfulness scores without providing meaningful information.

Additionally, the study explores the impact of universal versus context-dependent questions. Universal knowledge questions - those with stable, widely accepted answers - are generally handled well across languages. However, context-sensitive and time-dependent questions pose greater challenges, as LLMs struggle to adjust for regional variations and evolving facts. The authors argue that truthfulness benchmarks must include a balance of universal and contextual questions to better reflect real-world AI usage scenarios.

Towards a more reliable multilingual AI future

The study concludes that multilingual truthfulness benchmarks are essential for evaluating AI reliability worldwide. While English remains the dominant language for AI evaluation, extending benchmarks like TruthfulQA to multiple languages helps identify potential biases and misinformation risks in non-English AI interactions. Moreover, the research highlights that high-quality machine translation may provide a scalable alternative to professional translation for future multilingual benchmarks, reducing costs while maintaining evaluation accuracy.

Looking ahead, the authors recommend expanding truthfulness assessments to more languages and dialects, incorporating dynamic, real-time updates to address evolving knowledge, and refining LLM-as-a-Judge frameworks for more accurate automated evaluation. By taking these steps, the AI community can work toward developing more transparent, fair, and truthful AI systems across linguistic boundaries.

As AI-generated content continues to shape global discourse, ensuring truthfulness across all languages is a necessary step toward responsible AI deployment. This study represents a significant milestone in multilingual AI research, setting a precedent for future truthfulness evaluations and the broader fight against misinformation in the digital age.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback