Language inequality deepens as AI development favors dominant tongues
Artificial intelligence (AI) thrives on data, and in the race to build ever-larger models, developers have gravitated toward languages with abundant digital content. This strategy has accelerated innovation but has also created structural imbalances that mirror historical patterns of linguistic dominance.
The paper Artificial intelligence is creating a new global linguistic hierarchy provides evidence that AI language resources are intensifying inequality across 6,003 global languages. The findings challenge the widely held assumption that digital technologies naturally diffuse over time, reaching marginalized communities and minority language speakers. Instead, the study shows that AI language resources follow a power-law distribution, in which a small number of languages dominate the ecosystem and accumulate increasing advantages.
A power-law divide in global language AI
The researchers analyzed the availability of AI language models and datasets, focusing on repositories such as Hugging Face that serve as major hubs for open-source AI development. Their longitudinal dataset spans from 2020 to 2024, a period marked by the rise of large language models and conversational AI systems.
The results reveal that AI resources are heavily concentrated in a few languages, including English, Mandarin Chinese, French, and Spanish. The distribution follows a pattern consistent with Zipf’s Law, in which frequency declines sharply after the top-ranked items. The authors describe this intensification of inequality as “Zipfianisation,” a process in which early advantages become magnified over time.
English stands out as an extreme outlier. Its growth in model and dataset representation far exceeds what would be expected even under a standard power-law distribution. As the dominant language of global academia, technology development, and internet content, English benefits from both historical and structural advantages. These advantages have been amplified during the large language model era, when vast amounts of web text were used to train foundational systems.
On the other hand, thousands of languages remain severely underrepresented. Some languages with millions of speakers, including Nigerian Pidgin, Chittagonian, and Wu Chinese, receive minimal AI development relative to their population size. The researchers find that speaker population explains only a modest share of variance in AI resource distribution, indicating that other structural factors are at work.
The study also uncovers surprising imbalances. Certain European languages, including dead languages such as Latin and Ancient Greek, enjoy disproportionate AI representation compared to living languages spoken by larger populations. Historical prestige and academic interest appear to drive some of this imbalance, further skewing resource allocation.
Geographic patterns and diffusion dynamics
Languages in Sub-Saharan Africa, South Asia, and parts of the Middle East are significantly underrepresented in AI development. However, the gap does not map neatly onto a Global North versus Global South narrative.
In technologically advanced countries such as the United States and Australia, English dominates AI resources, while most non-English languages spoken within those borders receive little to no support. This pattern underscores that linguistic inequality exists even within high-income nations.
The researchers also identify counterintuitive cases. Nigeria, for example, has a relatively high proportion of languages with at least one AI model compared to some developed countries. Yet the overall depth and sophistication of these models remain limited, and the vast majority of local languages still lack meaningful AI integration.
Traditional technologies such as mobile phones and personal computers often follow S-shaped adoption curves, beginning slowly, accelerating as infrastructure improves, and eventually saturating markets. Language AI systems do not follow this pattern.
Instead, AI diffusion appears top-down and highly concentrated. Early industrial investment in high-resource languages created a lock-in effect. As companies built models using readily available English and other dominant language data, these languages gained further advantages. Subsequent improvements built on these foundations, reinforcing a cumulative “rich-get-richer” cycle.
Unlike grassroots technologies that spread through community networks, language models depend on large-scale data aggregation and centralized training pipelines. This structure limits opportunities for marginalized languages to catch up without intentional intervention.
Introducing the Language AI Readiness Index
The authors introduce the Language AI Readiness Index, known as EQUATE. This open-access framework is designed to evaluate the readiness of all attested languages for AI development .
The index integrates three key dimensions. The first is AI resource availability, including the presence of language models, datasets, and digital content. The second is digital infrastructure, covering internet connectivity, network performance, and technological access. The third is socioeconomic capacity, including education levels, GDP per capita, research investment, and broader human development indicators.
Through principal component analysis and hierarchical mixed-effects modeling, the researchers show that AI resource distribution and infrastructure readiness are distinct but complementary factors. Some languages exist in countries with strong infrastructure and educational systems yet remain underrepresented in AI resources. These cases represent underutilized potential, where targeted investment could yield rapid improvements.
On the other hand, a few languages with relatively small speaker bases, such as Esperanto, exhibit high AI resource representation due to concentrated community interest or academic engagement. These examples illustrate that population size alone does not determine AI visibility.
By ranking languages across dimensions, EQUATE provides policymakers and developers with a tool to identify leverage points. Governments can assess whether their languages are infrastructure-ready but resource-poor, signaling a need for dataset creation and model development. Technology firms can use the index to prioritize inclusive expansion strategies.
Structural implications for the AI era
Languages without AI support risk exclusion from digital services. The absence of models limits access to translation tools, voice assistants, automated customer support, and information retrieval systems.
This exclusion may have cascading effects. If educational resources, administrative systems, and digital interfaces prioritize a narrow set of languages, speakers of marginalized languages may face barriers to participation in digital economies. Over time, the hierarchy may influence language preservation and cultural continuity.
The authors caution that without deliberate corrective action, AI will deepen existing linguistic inequalities. The structural features of large-scale model training, including data aggregation practices and investment incentives, favor dominant languages. As companies optimize for return on investment, high-resource languages attract disproportionate attention.
The research calls for coordinated intervention. Public funding mechanisms could support dataset creation for underrepresented languages. International organizations could promote multilingual benchmarks and inclusion standards. Academic institutions could expand research on low-resource language modeling techniques.
The findings also suggest a need for transparency in reporting language coverage. Many AI systems advertise multilingual capability without disclosing uneven performance levels across languages. Clearer metrics could improve accountability and guide equitable development.
- FIRST PUBLISHED IN:
- Devdiscourse

