By smaller languages, I mean languages that are spoken by fewer people and have less digital/printed content in, also called low-resource languages in LLM jargon.

LLMs and Smaller Langauges?

Large language models are essentially (very good) probabilistic machines that predict the next token given a set of input tokens. They are able to do so (besides the research and engineering work) because they have been trained on trillions of human generated tokens produced throughout our existence. However, the less a particular type of training data seen by the LLM, the worse they are at reasonably predicting the next token from that dataset. This means that LLMs are naturally better at languages that are more popular than low-resource languages, assuming that the distribution of the training set is representative of the distribution of existing content in different languages.

According to w3techs, English dominates the web with almost 50% of all websites, with Spanish and German following at ~6%. It’s hard to know if this is distribution of the training data used to train LLMs or if the methodology used is truly representative of the underlying distribution but it’s easy to imagine that only certain languages dominate digital content that’s available.

The skew in the distribution might be even more pronounced in training datasets. GPT-3 for instance had 93% of training dataset documents in English. Llama 3 had around 5% of non-English data that covers 30 languages (they mention over 5% but don’t say how much more). BLOOM seems to be a transparent effort to account for inclusiveness of multiple languages with curated data of smaller languages, but it’s still hard to get around the problem of just lack of enough training content. The scale needed for training LLMs is huge, both in requirement of training data and compute, and smaller languages will naturally be at a disadvantage.

So LLMs perform instinctively worse in smaller languages with numerous studies and multi-lingual benchmarks showcasing this. According to MMLUProx LLMs perform markedly worse in low-resource languages, with gaps up to 24%. The performance disparity is obvious even anecdotally when asking my friends and colleagues about their LLM use. Just using a simple benchmark (ironically I asked Claude for the right benchmark) already shows the difference. I did the following with Claude Sonnet 4.5:

llm-small-lang-en

And then asked the same question but in my native language (Nepali):

llm-small-lang-nepali

Anyone speaking Nepali would immediately understand the trophy is too big.

Similarly, when asked a question that would require better knowledge and analytical capabilities in English:

ktm-sea-en

And in Nepali (asked to translate the answer to English afterwards):

ktm-sea-nepali

The answer in English clearly shows more depth, nuance and reasoning whereas the answer in Nepali is very surface level with generic listings.

Now with adoption of LLMs growing at such a fast pace across several industries and amongst consumers, how much would the prevalence of low-resource languages hurt, especially with knowledge concentrating more and more in these larger languages? LLMs will be used to produce more content, and for low-resource languages, the LLM produced content will get worse (future training data containing lower quality LLM generated content). People might abandon low-resource languages altogether, especially if they become economically unviable. Languages embed culture, distillation of low-resource languages by LLMs might lead to cultural distillation which is already prevalent with globalization.

I am yet not aware of a viable solution to these potential outcomes. AI companies are not incentivized economically to dedicate research, resources and compute for low-resource languages as there is no clear economic benefit, and the languages inherently have the problem of scarcity, which might be exacerbated by rapid adoption of LLMs. The same analogy also lies with programming languages—low-resource programming languages also suffer from the same fate, making it hard for existing small and new programming languages to flourish when every developer will have an LLM assisting them.

There is a potential counterargument worth considering: LLMs’ translation capabilities might actually help preserve smaller languages. High-quality translation could make content in low-resource languages more accessible to the wider world, and LLMs could help document and preserve endangered languages by assisting linguists and communities in creating dictionaries, grammar guides, and educational materials more efficiently. If translation technology continues to improve, it might reduce the need for everyone to converge on a handful of dominant languages for economic participation.