Language inequality in AI: Tokenization system favors English, marginalizes others

CO-EDP, VisionRI | Updated: 06-04-2026 07:19 IST | Created: 06-04-2026 07:19 IST

Language inequality in AI: Tokenization system favors English, marginalizes others — Representative image. Credit: ChatGPT

Artificial intelligence (AI) systems that power chatbots and translation tools are quietly embedding economic inequality into language itself, according to new research by Paolo Caffoni of the Karlsruhe University of Arts and Design. The study argues that the way AI breaks language into computational units is not just a technical process but a system that assigns uneven economic value to different languages, reshaping how linguistic labor is measured and monetized.

Published in AI & Society, the study titled “The cost of language: tokenization as a metric of labor” reframes tokenization as a core mechanism linking artificial intelligence to global labor structures, rather than a neutral engineering step.

Tokenization and the hidden cost of language in AI systems

For the unversed, tokenization is the process by which AI systems divide text into smaller units known as tokens. These tokens form the basis of how models process, predict, and generate language. However, the research finds that tokenization does not treat all languages equally.

Languages that use non-Latin scripts, such as Telugu, Arabic, or Chinese, are often broken into more tokens than English. This means users of these languages effectively pay more to use AI systems that charge based on token counts. The disparity is not marginal. In some cases, Telugu speakers may pay up to five times more than English users for equivalent outputs, while other languages show similarly uneven ratios.

The root of this inequality lies in how tokenization algorithms are trained. Most large language models rely heavily on English-dominated datasets, leading to more efficient encoding for English and less efficient representation for other languages. As a result, languages with different scripts or structures are fragmented into longer token sequences, increasing both computational load and financial cost.

The study highlights that this issue goes beyond pricing. Tokenization shapes performance, accuracy, and accessibility. Languages that are over-tokenized not only cost more but may also suffer from reduced model efficiency and quality. The research argues that they function as measurable units of linguistic value. This shifts the discussion from engineering to economics, positioning tokenization as a system that quantifies language in ways that mirror labor valuation.

From telegraphy to AI: the historical roots of linguistic inequality

The study places modern tokenization within a longer history of communication technologies that have reshaped language through economic and technical constraints. It draws a direct comparison to the telegraph era, when early communication systems favored alphabetic languages and imposed additional labor costs on non-alphabetic scripts.

In nineteenth-century China, for example, telegraph systems required Chinese characters to be converted into numerical codes before transmission. This double encoding process increased both time and labor costs, effectively making communication more expensive and complex compared to English-based systems.

This historical precedent introduces what the study describes as “alphabetic labor time,” a concept that links efficiency standards to specific linguistic structures. Technologies like Morse code were built around alphabetic assumptions, reinforcing a hierarchy where certain languages aligned better with technical systems and others faced structural disadvantages.

The research argues that modern AI systems reproduce similar dynamics. While tokenization differs from Morse code in being more flexible and data-driven, it still reflects underlying biases rooted in training data, infrastructure, and design choices.

The continuity is not accidental. Both telegraphy and tokenization emerged from the need to compress and transmit language efficiently. Yet both also embedded economic hierarchies into communication systems, privileging certain languages while marginalizing others. The study suggests that linguistic inequality in AI is not a new problem but a continuation of long-standing patterns shaped by global technological development.

Language as labor: how AI reshapes the global division of work

The research claims that tokenization is redefining language as a form of labor within digital economies. Tokens are not just units of text but units of work that can be measured, priced, and optimized.

Based on theories from political economy and linguistics, the research argues that language has always been tied to labor. Communication enables coordination, production, and exchange. In AI systems, this relationship becomes explicit as language is converted into quantifiable units that drive machine learning processes.

AI systems do not just process language; they reorganize it into workflows that resemble industrial production. Tasks are segmented, measured, and optimized based on statistical patterns rather than traditional human-defined roles.

The study highlights how this logic extends beyond language models into digital platforms. Services like ride-hailing, streaming, and logistics increasingly break human activity into discrete, measurable units similar to tokens. Each action becomes part of a larger system of prediction and optimization.

In this context, tokenization represents a broader shift toward what the study describes as the “automation of linguistic value.” Language is no longer just a medium of communication but a resource that can be quantified and integrated into economic systems.

The cost and structure of language itself may influence access to technology, participation in digital economies, and even global labor distribution. The study also raises questions about fairness and policy. Efforts to achieve “tokenization parity” across languages have been proposed, aiming to ensure that different languages are encoded with similar efficiency. However, the research suggests that technical fixes alone may not address deeper structural inequalities rooted in data, infrastructure, and global power dynamics.

FIRST PUBLISHED IN:
Devdiscourse

Language inequality in AI: Tokenization system favors English, marginalizes others

Tokenization and the hidden cost of language in AI systems

From telegraphy to AI: the historical roots of linguistic inequality

Language as labor: how AI reshapes the global division of work

TRENDING

Lautaro Martinez Sparks Inter Milan's Revival

Bank of Maharashtra Surges with 22% Credit Growth in Q4 FY26

Tragic Collision: Wedding Bus Hit by Speeding Harvester on Budaun-Delhi Road

Pakistan trying to influence Assam elections; 11 talk shows saying Cong shou...

OPINION / BLOG / INTERVIEW

Global South faces heightened AI risks amid gaps in education and digital readiness

Climate fintech must overcome data gaps and bias to deliver real impact

Agentic AI could amplify data breaches through system-wide leaks

AI not as harmless as it seems: cumulative effects raise new governance concerns

DevShots

Latest News

Sevilla's Rocky Road Continues Under New Leadership

Explosive Attack Rocks BJP Worker's Residence in Kathirur

Pakistan trying to influence Assam elections; 11 talk shows saying Cong should win: CM Sarma.

Tragic Collision: Wedding Bus Hit by Speeding Harvester on Budaun-Delhi Road

Connect us on

SECTORS

EDITIONS

OTHER LINKS

OTHER PRODUCTS

CONNECT