Overview
Production Readiness
0.6
Novelty Score
0.2
Cost Impact Score
0.6
Citation Count
313
Why It Matters For Business
Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some
Summary TLDR
This empirical study tests ChatGPT (GPT-3.5) and GPT-4 on public MT benchmarks. With the default ChatGPT engine, translations are competitive on high-resource European pairs but weaker on low-resource or distant languages and on domain/noisy text. Two practical fixes improve results: (1) pivot prompting (translate via a high-resource language) and (2) using GPT-4 as the engine, which brings quality close to commercial systems on tested directions.
Problem Statement
Can ChatGPT serve as a practical machine translator? If not, what helps it compete with commercial systems across languages and domains?
Main Contribution
A focused evaluation of ChatGPT (GPT-3.5) on multilingual translation and robustness using Flores-101 and WMT robustness/biomedical test sets.
Comparison versus three commercial translators (Google Translate, DeepL, Tencent TranSmart) using automatic metrics (BLEU, ChrF++, TER) and human annotation.
Two practical improvements tested: (a) pivot prompting via a high-resource language and (b) re-running with GPT-4; both show measurable gains.
Key Findings
Prompt wording matters but has only modest effect.
On high-resource European pairs, ChatGPT is close to commercial systems.
ChatGPT lags badly on some low-resource directions.
ChatGPT is weaker on domain-specific and noisy text but better on spoken-language transcripts.
Pivot prompting noticeably improves distant-language translation.
Switching to GPT-4 gives a broad quality boost and often reaches commercial levels.
Human annotation shows GPT-4 makes fewer translation errors than ChatGPT and Google.
Results
BLEU
BLEU
BLEU
BLEU
BLEU
BLEU
BLEU
BLEU
Human ranking
Who Should Care
What To Try In 7 Days
Run your most-used language pairs through ChatGPT/GPT-4 and compare BLEU or a small human review sample.
If a pair is low-resource or distant, try pivot prompting via English: ask model to output pivot then target.
Adopt the TP3 prompt template: 'Please provide the [TGT] translation for these sentences:' and sample 50 examples to spot common errors.
Reproducibility
Data Urls
- https://github.com/facebookresearch/flores
- https://github.com/mjpost/sacrebleu
- WMT test sets (WMT19 Bio, WMT20 Rob2/Rob3)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small sample sizes: only 50 sentences sampled per test set due to web access constraints.
- Results can vary across repeated queries; reported numbers are from single runs or limited versions.
- Scope limited to multilingual quality and robustness; no document-level, constrained decoding, or production latency analysis.
- Evaluation focused on a subset of languages and domains; not exhaustive.
When Not To Use
- Do not rely on vanilla ChatGPT (GPT-3.5) for critical biomedical translation or high-stakes legal/medical text without expert review.
- Avoid using ChatGPT for low-resource/distant language pairs without pivoting or extra validation.
Failure Modes
- Hallucinations and mis-translations (extra or invented content).
- Over-translation (adds content) and under-translation (omits content).
- Inconsistent outputs across repeated runs leading to evaluation variance.
- Short sentence errors due to differences in abbreviation vs. full-name choices (hurts BLEU).
Core Entities
Models
- ChatGPT (GPT-3.5)
- GPT-4
- Google Translate
- DeepL
- Tencent TranSmart
Metrics
- BLEU
- ChrF++
- TER
- SacreBLEU
Datasets
- Flores-101
- WMT19 Biomedical (Bio)
- WMT20 Robustness set2 (Rob2)
- WMT20 Robustness set3 (Rob3)
Benchmarks
- Flores-101
- WMT19 Bio
- WMT20 Rob2
- WMT20 Rob3

