Language doesn’t matter: AI gets it wrong and people still believe it
Despite differences in how certainty is expressed linguistically across these languages, the study found that people continue to overrely on AI outputs, particularly when models use confident language, even when the information is incorrect.
A new study has raised alarms about the growing global risks of overconfidence in artificial intelligence language models. The research has found that users consistently overtrust the outputs generated by large language models (LLMs) across diverse languages, with potentially dangerous implications for AI safety worldwide.
The findings are detailed in the study titled “Humans overrely on overconfident language models, across languages”, currently under review and published as a preprint on arXiv. The paper explores how epistemic expressions, linguistic cues indicating levels of certainty, affect human trust in AI-generated content in English, French, German, Japanese, and Mandarin. Despite differences in how certainty is expressed linguistically across these languages, the study found that people continue to overrely on AI outputs, particularly when models use confident language, even when the information is incorrect.
Are LLMs truly multilingual or just multilingually overconfident?
The study set out to examine whether the problem of LLM overconfidence, well-documented in English, persists across other languages. Using three prominent AI models (GPT-4o, LLaMA 3.1–70B, and LLaMA 3.1–8B), the researchers tested the models’ abilities to express varying levels of certainty in multiple languages using the Massive Multitask Language Understanding (MMLU) benchmark.
Results indicated that overconfidence is not confined to English. GPT-4o, the most accurate model tested, gave incorrect answers with strong expressions of certainty 15 percent of the time. For the LLaMA models, this figure rose dramatically to between 39 and 49 percent, suggesting a systemic issue in the design and calibration of these models.
Despite being trained to reflect linguistic norms, such as the frequent use of hedging in Japanese and strengtheners in German and Mandarin, the models still regularly misrepresented their epistemic certainty. For instance, the models used more markers of uncertainty in Japanese, where hedging is culturally normative, and more strengtheners in German and Mandarin. However, these adjustments did not correlate with a reduction in incorrect confident outputs. In short, the models were linguistically adaptive but functionally unreliable.
How do users interpret AI confidence across languages?
To assess the human side of the equation, the researchers turned to a behavioral framework known as REL-A.I., developed to measure user reliance on AI-generated text. They recruited bilingual participants fluent in English and one of the target languages to gauge whether reliance on model-generated content varied depending on linguistic and cultural context.
Participants were shown trivia questions with AI-generated answers featuring different levels of epistemic certainty: weak (e.g., “I think”), moderate (e.g., “It is likely”), strong (e.g., “I am confident”), and plain (e.g., “It is”). They were then asked whether they would trust the AI’s answer or search for the correct information themselves.
Across all languages, the pattern was clear: users consistently relied more on answers containing strong expressions of certainty. On average, 65 percent of users trusted confident responses without verification. When this was combined with the high error rates in such responses, the result was a notable risk of overreliance.
Surprisingly, Japanese users exhibited the highest overreliance rate, even though models used more hedging in Japanese responses. This indicates a cultural disposition to trust uncertainty markers, effectively negating the assumed safety benefit of increased hedging. The study found that Japanese users were 1.5 times more likely to trust overconfident, incorrect AI outputs compared to English speakers.
In contrast, German and French users were more skeptical of confident language, relying less on strengtheners and plain statements. Nevertheless, overreliance risks remained non-negligible in those languages, especially with lower-performing models like LLaMA 3.1–8B.
Can calibration alone solve the overreliance problem?
The researchers argue that efforts to improve AI calibration must be more nuanced than current strategies allow. While previous work focused on aligning model confidence with output accuracy, this study emphasizes the critical role of human interpretation in determining AI safety. The misalignment between how LLMs express confidence and how users interpret that confidence poses a dual-layered threat.
This issue is compounded by the nature of transfer learning in multilingual models. Most LLMs are primarily trained on English datasets and then generalized to other languages. This raises the risk of an “English epistemic bias” seeping into multilingual generations, where culturally inappropriate levels of confidence may be expressed in languages that traditionally employ hedging or ambiguity.
The study also suggests that the perception of uncertainty is deeply tied to cultural and linguistic norms, and not merely to the frequency or form of epistemic markers. For instance, while Japanese generations included more hedging, users were still more likely to trust those hedges than their English equivalents. This undermines the assumption that simply generating more uncertainty markers leads to safer model behavior.
Therefore, the authors call for culturally contextualized, user-centered evaluation methods that go beyond linguistic output to consider actual user behavior in real-world contexts. They stress the need for AI developers to build safety metrics that incorporate cross-linguistic and cross-cultural trust dynamics, rather than assuming uniform user responses to model outputs.
Global implications for AI safety
The findings paint a concerning picture of AI deployment in a multilingual world. While LLMs are often touted as globally capable, their epistemic behavior does not scale safely across languages. The interaction between model expression and human interpretation varies widely and unpredictably, rendering blanket safety measures insufficient.
The researchers argue that existing model evaluations must be expanded to include multilingual and multicultural dimensions, especially as AI systems are increasingly used in high-stakes domains like healthcare, education, and governance. A confident but incorrect model response in one language may be harmless, while in another it could trigger misplaced trust with serious consequences.
- FIRST PUBLISHED IN:
- Devdiscourse

