LLM: A Dying Language Saviour?

In the excitement around LLMs there’s been a focused and fascinating discussion about Low Resource Languages (LRL).

The interest in LRLs is primarily driven by the monetization opportunity for LLMs to expand chatbot and image generation capabilities into large markets where there’s lots of people who speak the language, but not much of it is written on the Internet.

Dying Languages vs Low Resource

There’s a difference between an LRL and a dying language. They share the characteristic of “not much is written on the Internet,” which is what’s needed to create the models, but the real striking difference is that in a dying or endangered language the number of people who are fluent can be counted on one hand.

For example, Thai is considered an LRL while having about 20 million native speakers. Compare that to Munsee which has only 2 elderly speakers as of 2018.

So from a business perspective the opportunity is clear. If an enterprise LLM can extend it’s model to the widely spoken but not written (on the internet compared to English) languages like Vietnamese, Swahili, Hindi, Thai, Urdu and Bengali they can increase the scope of opportunities for monetization.

But where does this leave us with endangered or dying languages?

AI for Good – Protecting Language

This is an opportunity for the leading LLM vendors to invest for the good of humankind. We have the technology and ability to save for posterity these endangered indigenous languages, keeping a part of our human history alive for future generations.

There is real investment required. We need to augment, for example, BLOOM from HuggingFace on a set of texts that includes stories and songs and other cultural artifacts to create the model that can generate new content in the language.

By leveraging transformer models we can revitalize endangered and dying languages. For example, by training these models on recordings of speakers it’s possible to create text-to-speech and speech-to-text applications that can not only create chatbots, but also help learners to practice their language skills.

Taken together we can not only save and protect endangered and dying languages, we in the AI industry can also support indigenous communities by creating applications to improve cross-lingual communication.