How large language models process multilingual data differently
Chat GPT has taken the world by storm and has re-ignited interest in Large Language Models (LLMs). While Chat GPT is free as a demo, models that are ready for general use like GPT-3 charge by usage. Your usage is based on the concept of tokens, which represent how a model processes the text. When typing a phrase, you can preview how many tokens you’ve used in the tokenizer page.
Let’s type in a phrase in English and see how many tokens it uses.
Now in French.
How about simplified Chinese.
That’s quite the variance! But why is the equivalent sentence using such a range of tokens?
Tokenization, parsing language into byte size pieces
Tokenization is a way to group characters and words together into common patterns. There are many techniques to do so, each with their benefits and drawbacks. Tokenizers can be shared across various models, but typically are specialized to the task a researcher is trying to optimize.
Above we saw an anecdotal example of the number of tokens used for a sentence, so let’s try to apply it to a more holistic dataset. We can look at a dataset such as MASSIVE, a dataset released by Amazon. MASSIVE contains 1 million phrases, or more precisely, utterances (which are commands for accomplishing a task). MASSIVE has the same utterances translated into 51 languages, making it a prime candidate for our experiment .
Below we use 8 different tokenizers  from common language models, and visualize how many tokens all these utterances have.
Let’s walk through the plots. On the X axis we have the name of the tokenizer and on the Y axis we have the number of tokens used. We can see that GPT and Facebook’s OPT models have the most variance and seem to be optimized for English. Other models do a better job of having a balanced approach to the their token usage.
If we look at the ratio between the largest and smallest token numbers we can start to get an idea of how much cost can become a factor.
More than 15.77x for GPT — that’s where the title came from!
Now that we have some data to work through, let’s see how much it would cost to do a task. If we just ran the utterances through GPT-3 with no prompt, how much would it cost for each language? GPT-3’s pricing is openly available, and the version of GPT-3 that is commonly used is DaVinci.
Multiplying our token cost by the number of tokens we get $27.98 for the most tokenized language vs $1.76 for the cheapest. That’s quite a difference. Now assume we added a prompt to each of the utterances to accomplish a task, such as “rewrite the following sentence into a nicer tone”. We also need to account for the response, since that’s part of the token count.
For this experiment, we use the first 51 utterances in the test portion of massive for English and Malayalam. And we get this usage, or a 15.69x difference. In line with our initial tokenization experiment.
Implications beyond cost
As LLMs become more widely used, the disparity between English and non-English writing will only grow. Accuracy has been a standard concern , as a smaller corpus of text is being used and most benchmarks measuring English performance . Bias and hate speech have been other concerns , with fewer native speakers to read through training data to confirm its validity for use.
If we put accuracy aside, and look purely at increased token usage we get four additional impacts: higher costs, longer wait times, less expressive prompts and more limited responses. Many underrepresented languages are spoken and written in the Global South, and with token usage currently pegged to the US Dollar, LLM API access will be financially inaccessible in many parts of the world. This likely means an inability to benefit from developments in the space until costs come down. For this reason, startups who prompt users in their native languages will likely be undercut by those who prompt users in English, French, Spanish or Chinese, undercutting local companies who are using a local language.
Secondly, certain tasks will be infeasible due to the the time it takes to generate extra tokens. GPT based models predict the next token at a time, meaning if many additional tokens need to be generated, the responses will much slower. Certain tasks like real time search or chatbot support will be too slow in these languages, where an application that takes 200ms now might take 3 seconds.
Thirdly, elaborate prompts will be impossible given token generation limits. Currently GPT-3 is limited to 2048 tokens for prompts. for full effectiveness. Given that prompt lengths are currently limited for GPT based models, tasks that require longer prompts like summarization is greatly affected.
Finally, response limitations are also at play with GPT-3 only able to return up to 4000 tokens for the prompt + response. In this specific example it is equivalent to generating a tweet in one language and medium sized blog post in another.
Conclusion — Why is tokenization optimized for English?
Now given the five implications described above, why is tokenization still so focused on English? The answer lies in the contents of the internet, which these models are trained on. The goal of the tokenizer is to create expressive patterns for the model that compress text down into small chunks and allows the model to be more accurate with subtle differences. Unfortunately, most benchmarks and training data is in English, which leads to English-based optimizations. Other models however do a better job of having a more representative tokenizer for handling multilingual tasks. Out of the eight models we saw in the experiment, 5 had relatively narrow spreads in their tokenizers .
All is not lost, research and engineering continues to bring us closer to more equal outcomes. One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages. Unsurprisingly, it has the best tokenization ratio mentioned in figure 3.
Costs for language models have also come down greatly, with a 66% cut coming this year from Open AI. Commercial and open source models continue to get better at handling longer forms of text, and becoming easier to run independently. The hardware for running these models also continues to become faster and cheaper. In addition to economic tailwinds, there should be a concerted effort to create models that are more accessible in every language.
- If you’re interested in exploring the charts and data more, checkout this Colab notebook where you can run the code yourself.
- All tokenizers are downloaded from Huggingface
- The models include: “xlm-roberta-base”, “bert-base-uncased”,”sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" , “google/mt5-base”, “facebook/nllb-200-distilled-600M”