Notebookcheck Logo

A surprising language beats English and Chinese in LLM tests, based on new academic study

According to the study in question, Polish leads all languages with an average accuracy of 88% at long-context scales. English ranks at number 6 on that scale. Pictured - a stock image of ChatGPT mobile. (Image source: Zulfugar Karimov on Unsplash)
According to the study in question, Polish leads all languages with an average accuracy of 88% at long-context scales. English ranks at number 6 on that scale. Pictured - a stock image of ChatGPT mobile. (Image source: Zulfugar Karimov on Unsplash)
A new multilingual benchmark shows Polish outperforming English and Chinese in long-context LLM tests, depicting how script and tokenization shape accuracy. Results show that language structure matters way more as context windows grow.

A new multilingual study that evaluates how large language models handle long documents has produced an unexpected piece of info: Polish, not English or Chinese, shows the highest accuracy when context windows stretch to 64,000 tokens and beyond. The findings come from the OneRuler benchmark introduced in a COLM 2025 paper, which tested 26 languages across retrieval and aggregation tasks.

The researchers compared model accuracy at multiple context lengths and found a clear shift once sequences became longer. According to the results chart (on page 6), Polish leads all languages with an average accuracy of 88% at long-context scales. English drops to sixth place, and Chinese ranks among the bottom four.

(Image source: One ruler to measure them all / COLM 2025)
(Image source: One ruler to measure them all / COLM 2025)

The study hints that the disparity may be tied to tokenization efficiency and script-based differences rather than simply training data volume. Languages using Latin-based scripts - such as Polish, French and Spanish - consistently performed better than those using logographic or abugida writing systems. Chinese, Korean, Tamil and others showed only moderate accuracy even at shorter contexts (and their accuracy deteriorated even further as sequences became longer). This complete 180 of expected rankings is interesting, because most widely deployed LLMs are trained primarily on English-heavy datasets. Yet the paper’s results indicate that once models must search, recall or summarize information buried deep inside long documents, structural aspects of the language take preference over dataset prevalence.

Other findings in the benchmark also support this interpretation. The performance gap between the strongest and weakest languages grows sharply as the context expands - from 11% at 8,000 tokens to 34% at 128,000 tokens. Another detail from the study shows how sensitive these tests can be to small instruction changes. For example, simply allowing the model to answer "none" if a target string is absent caused accuracy in English to drop by 32% at 128k tokens, as visible on page 2.

While the benchmark also compares model families. The results imply that long-context evaluation cannot rely solely on English testing and that performance generalizations across languages may be misleading if script and tokenization effects are ignored. As context windows get larger, linguistic differences grow more important, not less - and English’s dominance in LLM benchmarks may no longer be representative once sequence lengths climb into the tens of thousands.

(Image source: One ruler to measure them all / COLM 2025)
(Image source: One ruler to measure them all / COLM 2025)

No comments for this article

Got questions or something to add to our article? Even without registering you can post in the comments!
No comments for this article / reply

static version load dynamic
Loading Comments
Comment on this article
Please share our article, every link counts!
Mail Logo
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2025 11 > A surprising language beats English and Chinese in LLM tests, based on new academic study
Anubhav Sharma, 2025-11-23 (Update: 2025-11-23)