In a March 6, 2025 paper, researchers from China-based institutions the Shanghai AI Laboratory, Westlake University (Hangzhou), and Northeastern University (Shenyang) demonstrated that large language models (LLMs) still suffer from “translationese” — overly literal and unnatural translations that deviate from native linguistic norms.
They explained that while previous research has explored translationese in traditional machine translation (MT) systems, there has been limited work on whether this issue persists in LLMs.
Given that LLMs are trained on vast corpora of native-language text, one might expect them to be less susceptible to translationese and more capable of producing natural translations. However, their study reveals the opposite: LLMs still produce “unexpected” unnatural translations and translationese remains a “persistent challenge” for AI translation.
The researchers evaluated various LLMs, including GPT-4, GPT-3.5, ALMA-7B/13B, and Mistral-7B, in the English-Chinese and German-English language pairs. They found that all LLMs exhibit “significant translationese errors” in both language pairs.
Specifically, more than 40% of GPT-4’s translations contained translationese errors, while Mistral-7B had the highest rate at 76% for the English-Chinese language pair. Additionally, larger models produced more natural translations than smaller ones.
“Polishing” Helps
The researchers first explored whether prompting strategies could reduce translationese. In addition to a standard translation prompt (Please translate the following {source_language} text to {target_language}), they tested two alternatives: a “specified” and a “polishing” prompt.
The specified prompt includes specific requirements that intend to improve naturalness, while the polishing prompt instructs the model to refine its own translations in a two-step process: first generating a translation, then improving it.
Interestingly, the researchers found that merely specifying naturalness requirements in prompts did not reliably reduce translationese — and in some cases, made translations worse. For example, under specified prompts, GPT-4 exhibited an increase in translationese errors.
Conversely, asking LLMs to refine their own translations proved more effective. In particular, GPT-4 reduced translationese from 43% to 25% when it was instructed to polish its outputs.