The Need for Multilingual AI in Developing Countries
How English Dominance in AI Could Widen Global Inequalities and What We Can Do About It
Do you know how much I’ve missed writing? It’s been five long weeks since I last reached out to you. I underwent significant surgery last month, and while I’m still in recovery, I’m more eager than ever to dive back into our discussions.
Today, I want to explore an issue that has been on my mind for some time: the dominance of English in AI and its implications for the world. This isn’t just about technological advancement; it’s about ensuring that technology resonates with everyone, regardless of where they are or the language they speak.
“The limits of my language mean the limits of my world.”
As the Austrian philosopher Ludwig Wittgenstein once said. Today, this idea is more relevant than ever. The language we speak shapes our interaction with technology, and when digital tools are confined to a single dominant language, it limits equitable access and use of these emerging technologies.
AI’s reliance on English means that it can only 'think, act, and communicate' like a subset of humanity. This selective representation marginalizes non-English speakers and cultures, exacerbating global social and ethical issues. But beyond this global concern lies another critical issue: the impact of English dominance in AI on developing countries.
Some argue that prioritizing English in AI is a practical necessity, given its status as the world’s lingua franca. Indeed, English serves as a bridge language, connecting people across different cultures and geographies. From a purely pragmatic standpoint, it might seem logical to focus on English to maximize the reach and utility of AI tools.
However, this approach has significant downsides. The inability of developing nations to fully benefit from AI—due to language barriers—could widen existing development gaps and hinder progress. As AI-driven economies advance rapidly, there is a real risk that middle and upper-middle-income countries could be increasingly sidelined, potentially leading to their gradual marginalization in the global economic landscape.
Digital Language Divide
Out of the approximately 7,000 languages spoken globally, only around 20 are considered "high-resource" due to their extensive digital presence. Yet, even within these high-resource languages, there is a considerable divide.
Take, for instance, the most spoken languages in 2024 , which includes Telugu and Turkish, both ranking within the top 20 globally.
Despite their widespread use, their digital footprints are disproportionately small. Turkish, for example, is the 18th most spoken language worldwide, yet it accounts for only 1.8% of internet content and a mere 0.8% of the data used in AI training datasets. This categorizes Turkish as a medium-resource language, significantly underrepresented in the digital world.
This disparity becomes even more apparent when we look at languages like Hindi and Chinese. Hindi, the third most spoken language globally, has a digital presence that is often overshadowed by English. Many Hindi speakers, being proficient in English, contribute to the digital footprint in English rather than in Hindi. Similarly, Chinese, despite its vast number of speakers, suffers from underrepresentation in global AI datasets due to the unique political and technological landscape in China, where internet access is controlled and limited. (Don’t forget Telugu being the most spoken languages with low CC resource, we ll need the data below)
AI Trained Data Explained
There is a clear and significant gap between the languages spoken globally, the languages used on the internet, and those represented in AI training data.
In Common Crawl, seen in the second graph, English makes up almost the 50% of the corpus, with other (mostly related) European languages accounting for 38% more. It disproportionately dominates AI training datasets, seeing above English makes up a staggering 93% of the training data of GPT-3.
This disparity is striking, especially considering that GPT-3 is the engine behind ChatGPT, which is widely used by millions of people globally, many of whom rely on the free version of the tool.
This imbalance of resources creates a "digital language divide," reflecting and exacerbating the existing disparities. Whether a language is classified as high-resource or not, the dominance of English on the internet means that other languages suffer from underrepresentation in AI training datasets. This underrepresentation is part of a broader issue where having a large number of speakers does not necessarily translate into a significant digital footprint. This underscores the challenges that even widely spoken languages face in gaining adequate representation in the digital landscape, which in turn affects their visibility and efficacy in AI systems.
You may say ChatGPT is Multilingual
It’s true that with advancements like GPT-4, ChatGPT officially supports 58 languages. On the surface, this suggests a level of multilingualism that could bridge language gaps and provide access to AI’s capabilities across diverse linguistic communities. However, the reality is more complex.
While ChatGPT can process and generate text in multiple languages, its performance in non-English languages often falls short. This is primarily because the quality and quantity of data available for these languages are not as robust as those for English. (Before checking academic papers, I looked to Reddit to see how ChatGPT handles non-English tasks, and saw how it performs poorly. Honestly, I’ve noticed this firsthand, using it in Turkish, can be frustrating—it often fumbles even the intermediate tasks, which makes me hesitant to rely on it in my native language.)
Want to summarize a research for you to understand how ChatGPT performs:
Adequate for Simple Tasks: The model performs relatively well on basic NLP tasks like Part-of-Speech tagging, even in lower-resource languages.
Competent in Medium-Resource Languages: While not perfect, ChatGPT shows a reasonable level of competence in medium-resource languages like Turkish, performing better than in low-resource languages.
Performance Gap:
Dependence on English Prompts: The model often requires English task descriptions to perform optimally in non-English languages. This dependency on English highlights a significant limitation, as it suggests that ChatGPT’s understanding and processing of non-English languages are less robust.
Suboptimal Zero-Shot Learning: In zero-shot learning scenarios, where the model encounters tasks it hasn’t been specifically trained on, ChatGPT underperforms compared to task-specific models.
Underperformance in Complex Tasks: ChatGPT struggles with high-level reasoning and semantic understanding in non-English languages, such as Common Sense Reasoning or Named Entity Recognition.
Inconsistent Performance Across Versions: Different versions of ChatGPT might yield varying results in multilingual tasks.
Remember the Telugu numbers above;
ChatGPT-4, scored 85% on a common question-and-answer test in English. When taking the test in Telugu, an Indian language spoken by nearly 100m people, for instance, it scored just 62%.(Even though OpenAI says Telugu is officially supported)
So while ChatGPT’s multilingual capabilities are a step in the right direction, they are far from perfect. The uneven quality and performance across different languages underscore the ongoing challenge of creating AI systems that truly serve a global audience. To achieve genuine multilingualism in AI, much more needs to be done to ensure that all languages are represented with the same depth and accuracy as English.
Why Multilingual AI Matters “Prior” for Developing Countries
As AI becomes increasingly integrated into various systems worldwide, the divide between developed and developing nations threatens to widen. While AI promises tremendous advancements in economic growth, public services, and technological innovation, the benefits of these advancements are not evenly distributed. For developing countries, the urgency of addressing the language barriers in AI is not just a matter of inclusion—it is a critical necessity for their future development.
1. Economic Growth: The Leapfrog Opportunity
GenAI has emerged as a powerful catalyst for economic growth and innovation, with the potential to significantly boost global GDP. According to EY, generative AI could increase global GDP by $1.7 to $3.4 trillion over the next decade, influencing over half of the global workforce. However, the realization of this potential largely depends on the ability of nations to adopt and integrate AI technologies effectively.
For many developing countries, AI presents a unique opportunity to leapfrog traditional stages of economic development. For instance:
Agriculture: AI-powered tools can help farmers increase crop yields through precision agriculture, better resource management, and efficient supply chain operations. However, for these technologies to be truly effective, they need to be accessible in local languages. Without localized AI tools, the agricultural sector in developing countries may fail to capitalize on these advancements, leading to a widening productivity gap.
Industrial and Service Sectors: In industries that form the backbone of many developing economies, such as manufacturing and services, AI can streamline operations, improve customer service, and reduce costs. Again, the effectiveness of these technologies hinges on their ability to operate in the languages spoken by the workforce. Without multilingual AI, these sectors risk falling behind, unable to compete with counterparts in developed nations.
2. Public Services: Transforming Education and Healthcare
Public services are critical to a country’s development and the well-being of its citizens. In developing countries, AI has the potential to revolutionize these services, particularly in education and healthcare:
Education: AI-powered tools can personalize learning, addressing the educational needs of students at various levels and reducing the teacher-student gap prevalent in many underserved areas.
Globally, an estimated 58 million additional teachers are needed to meet current demands. In countries with high pupil-teacher ratios, AI could provide personalized instruction in a student’s native language could drastically improve educational outcomes.
Healthcare: WHO recommends at least 45 doctors, nurses, and midwives for every 10,000 people. However, many developing countries only have a fraction of this number. AI-powered diagnostics and telemedicine can break down barriers to healthcare access, offering remote communities a chance at quality services previously beyond reach.
But for these AI-driven healthcare solutions to be effective, they must be able to understand and interact in the local languages of the patients they serve.
The dilemma is clear: AI holds the potential to significantly reduce disparities, yet the very challenges that make AI adoption essential in developing regions—such as inadequate infrastructure and insufficient language-specific data—also impede these countries from fully utilizing AI. This lack of representation in AI training data limits the effectiveness of AI tools, putting these nations at risk of falling further behind in the global technological race. Therefore, developing countries should prioritize building local NLP applications to contribute to global multilingual efforts, ensuring their languages and cultures are adequately represented.
3. Preventing the Deepening of Global Inequalities
When AI models are predominantly trained in English, they inherently favor English-speaking cultures.
Language as a Cultural Right: Language is more than just a means of communication; it's a fundamental aspect of cultural identity.
When AI models are used to create content in languages or cultures they are not adequately trained on, they can inadvertently misrepresent or trivialize complex cultural narratives.
They risk marginalizing entire communities, effectively erasing their cultural expressions from the digital world. The cultural practices and expressions tied to other languages may diminish, leading to a loss of cultural diversity and concluding to a homogenization of culture.
The Risk of Digital Colonialism: This can lead to a form of cultural imperialism, where local languages and knowledge systems are undervalued or ignored. Ensuring that AI respects and includes diverse languages is crucial to preventing this new form of inequality.
Exclusion from AI Benefits: If AI systems remain predominantly English-centric, the vast potential of AI to improve basic human needs education, healthcare, and economic opportunities may not reach the people who need it most.
Global Efforts to Make AI Multilingual
Global efforts are essential to creating ethical, multilingual AI that serves all communities equitably. Many governments recognize this need and are investing in initiatives to broaden AI’s linguistic capabilities, understanding that true inclusivity requires local involvement and oversight. It’s crucial for nations to actively participate in bridging the digital gap by ensuring AI systems are developed and managed by those who understand the local languages and cultural contexts. This approach will help ensure that AI technologies are not only multilingual but also ethically sound and representative of diverse communities.
Türkiye: TÜBİTAK BİLGEM has initiated the development of a "Turkish Large Language Model."
The model is being developed using a comprehensive dataset of Turkish texts gathered from the internet and other digital sources. This process includes a specialized preprocessing phase to account for the nuances of the Turkish language and the creation of a Turkish-specific tokenizer, which enhances the model's effectiveness in natural language processing tasks such as question answering, summarization, text generation, and classification.
India: The Indian government is developing Bhashini, an AI translation system to improve human-machine interaction on its 22 official languages, as seen in their Technology Development for Indian Languages (TDIL) initiative.
UAE: Launched in 2023, “Jais” AI is an Arabic language model developed by G42, capable of generating high-quality text in Arabic, including regional dialects. The development of Jais involved digitizing vast amounts of Arabic text, addressing the scarcity of high-quality digital resources in the language.
New Zealand: Te Hiku Media is using AI to aid in the preservation and revitalization of the Māori language, achieving a 92% accuracy rate in automatic speech recognition models developed with Nvidia's assistance.
Africa: Masakhane, a grassroots organization, is working to enhance NLP research in African languages, addressing the underrepresentation of the continent's roughly 2,000 languages in technology.
Nigeria: The Nigerian government has launched its first multilingual LLM, which is being trained on five low-resource languages and accented English to strengthen language representation in AI solutions.
EU: Common European Language Data Space provides a platform to facilitate the exchange of language data and resources among stakeholders, driving innovation across diverse linguistic contexts.
Denmark: The government has committed EUR 4 million to enhance Danish language resources, as detailed by the Ministry of Foreign Affairs of Denmark
These efforts, as highlighted in a 2023 OECD report, illustrate the growing recognition of the need for multilingual AI. However, the report also points out that the high costs and resource demands associated with developing and deploying language models remain significant barriers, particularly for minority languages.
Actionable Recommendations: Moving from Theory to Practice
For developing countries, the stakes in creating multilingual AI are particularly high. Addressing the challenges of linguistic diversity in AI is not just about inclusion; it's about ensuring that these nations can fully participate in and benefit from the global digital economy. To bridge the existing gaps and empower developing countries through AI, the following priority actions are essential:
1. Resource Allocation: Prioritizing Local Innovation
In developing countries, resource constraints often limit the ability to invest in cutting-edge AI research. Governments and international organizations should prioritize funding specifically earmarked for multilingual AI research and development within these regions. This includes grants and subsidies aimed at fostering local innovation, building research capacity, and supporting startups focused on AI solutions tailored to local languages and contexts.
2. Dataset Creation: Building Indigenous Language Resources
Creating high-quality datasets for under-resourced languages is critical for the effectiveness of AI in developing countries. Governments should work closely with local communities, educational institutions, and international partners to collect and curate data in indigenous languages. Engaging native speakers in this process ensures that the datasets are accurate and culturally relevant. These datasets can then be shared across borders, enabling broader collaboration and use in global AI models.
3. Safety and Multilingual Approaches: Ensuring Cultural Relevance
Developing countries face unique cultural and social challenges that must be reflected in AI systems. Therefore, it's essential to develop safety evaluation tools that consider the specific linguistic and cultural contexts of these nations. By ensuring that AI systems are safe and relevant in local languages, we can prevent the marginalization of communities and reduce the risk of digital exclusion.
4. Knowledge Sharing and Capacity Building: Fostering Local Expertise
To build sustainable AI ecosystems, developing countries need to invest in knowledge sharing and capacity building. This can be achieved through partnerships with international organizations, academic institutions, and the private sector. These partnerships should focus on training local researchers and developers, providing them with the skills and resources needed to create AI systems that reflect the linguistic and cultural diversity of their countries. Additionally, fostering a culture of transparency and collaboration will help build trust and ensure that AI development benefits all sectors of society.