On October 17, 2024, the OpenGPT-X Team released machine-translated versions of five well-known benchmarks in 20 European languages, enabling consistent and comparable evaluation of large language models (LLMs).
Using these benchmarks, the team evaluated 40 state-of-the-art models across the languages, providing valuable insights into their performance.
The OpenGPT-X Team highlighted the challenges of evaluating LLM performance consistently across languages. According to the researchers, “evaluating LLM performance in a consistent and meaningful way […] remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks.”
They also noted the high costs and time required to create custom benchmarks for each language, which has led to a “fragmented understanding of model performance” across different languages. “Without comprehensive multilingual evaluations, comparisons between languages are often constrained,” they explained, particularly for languages beyond the widely supported English, German, and French.
To tackle this, the team employed machine-translated versions of widely used datasets, aiming to assess whether such translations could provide scalable and uniform evaluation results.
Machine-Translated Benchmarks as a Reliable Proxy
Specifically, they translated five well-known datasets — ARC for scientific reasoning, HellaSwag for commonsense reasoning, TruthfulQA for factual accuracy, GSM8K for mathematical reasoning and problem-solving abilities, and MMLU for general knowledge and language understanding — from English into 20 European languages using DeepL.
“Our goal is to determine the effectiveness of these translated benchmarks and assess whether they can substitute manually generated ones,” the team stated.
Their findings suggest that machine-translated benchmarks can serve as a “reliable proxy” for human evaluation in various languages.
Top Performers and Language Trends
Using the translated datasets, along with the multilingual FLORES-200 benchmark for translation tasks, the team evaluated 40 models across 21 European languages.
They identified Meta’s Llama-3.1-70B-Instruct and Google’s Gemma-2-27b-Instruct as the top-performing models across multiple tasks. Llama-3.1-70B stood out in knowledge-based tasks, like answering general questions (MMLU) and solving math problems (GSM8K), as well as in commonsense reasoning (HellaSwag) and translation tasks. Meanwhile, Gemma-2-27b-Instruct excelled in scientific reasoning (ARC) and giving factually accurate answers (TruthfulQA).
Smaller models like Gemma-2-9b-Instruct, though consistent in common tasks, struggled in specialized domains. The researchers noted, “the capacity of small models might not allow for reliable performance on all languages and specialized knowledge.”
Additionally, high-resource languages like English, German, and French consistently saw better results, while medium-resource languages, such as Polish and Romanian, displayed weaker performance across tasks.
The results are publicly available through the European LLM Leaderboard, a multilingual evaluation platform.
The team emphasized the broader impact of their work: “By ensuring that LLMs can perform well in languages beyond English or other high-resource languages, we contribute to a more equitable digital landscape.”
To encourage further research, the team has made the machine-translated datasets available to the NLP community. “We aim to foster further research and development in multilingual LLM evaluation, driving improvements in cross-lingual NLP applications,” they concluded.