Data colonialism could undermine AI efforts to preserve endangered languages

Data colonialism could undermine AI efforts to preserve endangered languages
Representative Image. Credit: ChatGPT

A new study has warned that artificial intelligence (AI) could either support a new era of Indigenous language revitalization or deepen old patterns of extraction, depending on who controls the data, tools and decisions behind language technologies. The review finds that large language models (LLMs) and AI translation systems may help document, teach and expand endangered languages, but they also risk turning community knowledge into raw material for corporate or state systems without consent, ownership or fair benefit.

The study, titled "Data colonialism and indigenous languages in AI: a critical review of existing initiatives and their struggles with data sovereignty," was published in AI & Society and it examines AI initiatives involving Irish Gaelic, Māori, Guaraní and Inuktitut, using the concepts of data colonialism and Indigenous data sovereignty to assess whether these projects empower communities or reproduce extractive digital power structures.

AI language tools offer promise but carry colonial risks

A large share of the world's languages is endangered, and many Indigenous languages face shrinking intergenerational transmission, limited digital visibility and pressure from dominant national or global languages. AI offers potential support by making these languages more usable in modern digital life.

If developed responsibly, AI could help produce educational content, transcribe oral histories, support speakers in public services, power speech tools, assist translation and give younger generations more ways to use endangered languages online. For communities fighting language loss, digital visibility is not a symbolic concern. It can shape whether a language is available in schools, phones, government portals, media systems and everyday communication.

AI's technical promise cannot be separated from power, the study stresses. LLMs are usually built from enormous corpora of scraped text, audio and other data. When Indigenous language materials are collected without proper consent, community control or benefit-sharing, the process mirrors older colonial patterns of extraction. Land, labor and resources were once seized by colonial powers; today, linguistic data, oral histories, cultural expressions and community archives can be absorbed into proprietary AI systems.

The author uses the concept of data colonialism to describe this process. The concern is not that any single community legally owns an abstract language as code. Rather, the issue is control over the material forms that make a language computable: texts, recordings, annotations, datasets, aligned translations, trained models and digital infrastructures. These assets can contain collective knowledge and cultural meaning. When external actors control them, communities may lose authority over how their language is represented, reused or commercialized.

This problem is particularly severe when Big Tech platforms add Indigenous languages to translation systems or AI models without robust co-governance. Publicly available language data may be treated as free training material, even if the resulting systems are closed, commercial or controlled outside the community. The language becomes visible on a platform, but the community may have little say over quality, cultural framing, data reuse or economic value.

The review compares this extractive model with Indigenous data sovereignty. Under this framework, Indigenous peoples should control the collection, ownership and application of data connected to their communities, cultures, lands and languages. The author highlights the CARE principles, which stand for Collective Benefit, Authority to Control, Responsibility and Ethics. These principles challenge the mainstream open-data logic that prioritizes findability, access and reuse, arguing that data use must also ask who benefits, who decides and whether the use is culturally and ethically appropriate.

For AI language projects, this means consent cannot be an afterthought. Communities should help define goals, curate data, set licensing terms, decide access rules, review outputs and share benefits. Success should not be measured only by model accuracy. It should also be judged by community acceptance, cultural integrity, data governance and long-term empowerment.

Gaelic, Māori, Guaraní and Inuktitut show sharply different governance paths

The review uses four language contexts to show how AI initiatives can follow different ethical trajectories. Irish Gaelic is presented as a case where state support, academic research and public licensing create important opportunities, while also raising questions about dependence on Anglophone AI infrastructure and potential cultural mismatch.

Irish Gaelic has strong symbolic, legal and policy status in Ireland, but daily fluent use remains limited compared with the number of people who report some ability in the language. The review notes that Ireland has launched digital language initiatives to ensure Irish keeps pace with major European languages in the digital age. Academic work such as UCCIX, an open-source Irish large language model, shows that focused AI models can perform well even for low-resource languages when training is carefully adapted. The state-backed ArdIntleacht na Gaeilge project aims to build an Irish-language AI assistant for public services, including speech recognition, speech synthesis, translation and dialect-sensitive communication.

These projects show how public investment can support language revitalization. Open-source releases and public-benefit licensing help guard against a scenario in which Irish-language data is locked into private systems. However, the author also warns that the use of base models trained in dominant-language environments may carry Anglophone assumptions into Gaelic AI tools. A system may be technically fluent yet culturally misaligned if its conversational norms, values and representations are shaped by English-language internet culture. Surveillance, dependency on outside vendors and appropriation of culturally important content remain risks.

The Māori case offers one of the strongest examples of community-led governance. Te Hiku Media, a Māori-owned organization, built speech recognition and language tools using Māori-controlled data and consent-first methods. Rather than surrendering decades of Māori audio archives to outside technology firms, it developed tools under Māori terms, including licensing practices designed to protect data sovereignty. The review presents this as a model of Indigenous AI development anchored in guardianship, community control and cultural purpose.

The Guaraní case shows a different challenge. Guaraní has millions of speakers and official status in Paraguay, yet digital support remains uneven. Its inclusion in major translation platforms represents technical progress and public visibility, but The author warns that platform-led adoption can be tokenistic if community governance is limited. When corporate systems rely on public Guaraní text without shared authority or clear benefit flows, the language may be included as a feature while communities remain outside decision-making. Smaller academic projects offer promise, but the review argues that national investment and community-led infrastructure are needed to prevent Guaraní from becoming a checkbox in global AI systems.

Inuktitut shows both the power of public policy and the risks of platform enclosure. Canada's bilingual public records helped generate a substantial Inuktitut-English corpus, enabling translation and language technology projects. Government research initiatives that worked with Indigenous advisory structures and avoided claiming ownership over Indigenous language data are presented as positive examples. Yet when large platforms incorporate Inuktitut into proprietary translation systems, communities may benefit from wider access while still lacking direct control over the model, its training choices and future uses. Dialect variation, script differences, accuracy problems and privacy concerns remain live issues.

Across these examples, the author finds that leadership is important. Government-led projects can provide funding, scale and public-service pathways, but they must avoid centralized control. Big Tech can provide computational power and visibility, but it often raises the sharpest concerns over data extraction and proprietary enclosure. Community-led projects may face resource constraints, yet they are more likely to protect sovereignty, cultural representation and local priorities.

Community agency must decide whether AI preserves or exploits languages

Indigenous-language AI should be built around community agency rather than technical solutionism. AI tools can support language preservation, but only when communities define the purpose, govern the data and retain meaningful authority over outputs. The author argues that language data should not be treated as an open resource detached from people. For many Indigenous communities, language is tied to identity, land, memory, oral tradition, cultural law and collective survival. When datasets are extracted from that context, AI systems can distort meaning, flatten dialects, misrepresent cultural knowledge or commercialize expressions that communities regard as sacred or collectively held.

The study also stresses the difference between relational and extractive ethics. A relational approach treats language data as part of a living community relationship, requiring consent, reciprocity, accountability and benefit. An extractive approach treats data as a commodity, useful mainly because it improves a model, expands a platform or creates a marketable product.

Policy has a major role in shaping which path AI follows. Governments can support Indigenous language technologies through funding, open public corpora, public-service mandates and legal protections. But they must work in partnership with Indigenous communities, not simply on their behalf. The review points to the need for principles such as Free, Prior and Informed Consent before community language data is used in AI systems. It also calls for oversight structures that include the communities whose language is being modeled.

The paper also identifies some practical barriers. Many Indigenous communities lack the funding, computational infrastructure, high-quality datasets or technical teams needed to build their own large-scale AI tools. Open-source models and shared methods can help, but they do not automatically solve the sovereignty problem. A dataset can be open in a technical sense while still violating community expectations if consent and cultural governance are absent.

  • FIRST PUBLISHED IN:
  • Devdiscourse

TRENDING

OPINION / BLOG / INTERVIEW

Renewable energy cuts emissions in GCC, but oil dependence keeps climate pressure high

One-size-fits-all healthcare AI may deepen global health gaps

Machine learning could solve renewable energy’s 'uncertainty' problem

Automation is changing cybersecurity workflows, not replacing human expertise

DevShots

Latest News

Connect us on

LinkedIn Quora Youtube RSS
Give Feedback