English-centric AI raises equity concerns in multilingual classrooms

The study systematically benchmarks six LLMs on four critical educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and translation grading. These tasks were evaluated in Hindi, Arabic, Farsi, Telugu, Ukrainian, and Czech, alongside English. Despite claims of multilingual training, LLMs demonstrated a persistent English-language bias.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 29-04-2025 18:16 IST | Created: 29-04-2025 18:16 IST
English-centric AI raises equity concerns in multilingual classrooms
Representative Image. Credit: ChatGPT

Classrooms worldwide are rapidly adopting artificial intelligence (AI), but questions about equity, accuracy, and accessibility are growing harder to ignore. A new study is shining a critical spotlight on the multilingual capabilities of large language models (LLMs) used in education. Titled “Multilingual Performance Biases of Large Language Models in Education” and published on arXiv, the research evaluates how well popular AI models like GPT-4o, Gemini, Claude, Llama, and Mistral perform educational tasks across six languages beyond English. Their findings reveal both surprising strengths and alarming gaps in the performance of LLMs, particularly when it comes to supporting non-English-speaking students.

How well do LLMs perform educational tasks across different languages?

The study systematically benchmarks six LLMs on four critical educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and translation grading. These tasks were evaluated in Hindi, Arabic, Farsi, Telugu, Ukrainian, and Czech, alongside English. Despite claims of multilingual training, LLMs demonstrated a persistent English-language bias.

On average, English outputs achieved the highest performance with an average score of 70.9%, whereas languages like Telugu (49.7%) and Czech (55.3%) showed much lower scores. GPT-4o and Gemini-2.0 emerged as the most reliable models across languages, maintaining high consistency in performance, while models like Claude and Mistral struggled significantly, especially in low-resource languages.

Importantly, while models performed reasonably well in major languages like Arabic and Farsi, the gaps compared to English outputs were non-trivial. The research highlights that English-centric pretraining continues to dominate model behavior, reinforcing linguistic inequalities even within advanced AI systems. This has direct implications for their real-world deployment in multicultural and multilingual educational settings.

What are the main challenges in using LLMs for multilingual education?

The research uncovers several key challenges. First, prompt language played a significant role. Contrary to intuition, prompts in English led to better outcomes across most languages compared to translated prompts. Even with manual translation checks, translated prompts often introduced complexities and degraded performance, suggesting that English should remain the primary language for prompt design in multilingual deployments.

Second, task complexity and language typology introduced major hurdles. For instance, feedback selection tasks revealed the greatest weaknesses across all models, especially in non-English languages. Moreover, tutoring tasks exposed large inconsistencies: success rates were significantly lower for Czech and Telugu due to cultural and linguistic nuances that models failed to handle.

Third, translation quality affected results. Although Azure Translate was used to create datasets for each language, minor translation errors, especially in domain-specific terminology and syntax, introduced noise. This means part of the observed underperformance might be due to translation artifacts rather than model flaws alone, complicating the attribution of errors.

Moreover, models like Mistral and Command-A showed high variability across languages and tasks, making them unreliable choices for multilingual educational applications without extensive pre-deployment testing.

What recommendations does the study offer for developers and educators?

The study offers critical recommendations. Developers and educational practitioners are advised not to assume that a model performing well in English will automatically deliver comparable quality in other languages. Before deploying LLMs in multilingual classrooms, it is essential to validate model performance in each target language and for each specific educational task.

Another key takeaway is that keeping prompts in English might offer a more reliable baseline, simplifying monitoring, debugging, and ensuring quality across multilingual applications. The study also advocates for greater investment in multilingual pretraining and fine-tuning, especially for low-resource languages that are currently underserved by LLMs.

Finally, the researchers encourage open benchmarking and the use of multilingual evaluation frameworks. They released their own multilingual educational dataset, comprising 91,000 model outputs, to facilitate further research and help develop better models that can equitably support diverse student populations worldwide.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback