Overview
Study uses standard benchmarks and human checks but is limited to small random samples (50 each) and web-accessed ChatGPT; results are indicative but not definitive.
Citations313
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 9/9
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 20%
Why It Matters For Business
Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some
Who Should Care
Summary TLDR
This empirical study tests ChatGPT (GPT-3.5) and GPT-4 on public MT benchmarks. With the default ChatGPT engine, translations are competitive on high-resource European pairs but weaker on low-resource or distant languages and on domain/noisy text. Two practical fixes improve results: (1) pivot prompting (translate via a high-resource language) and (2) using GPT-4 as the engine, which brings quality close to commercial systems on tested directions.
Problem Statement
Can ChatGPT serve as a practical machine translator? If not, what helps it compete with commercial systems across languages and domains?
Main Contribution
A focused evaluation of ChatGPT (GPT-3.5) on multilingual translation and robustness using Flores-101 and WMT robustness/biomedical test sets.
Comparison versus three commercial translators (Google Translate, DeepL, Tencent TranSmart) using automatic metrics (BLEU, ChrF++, TER) and human annotation.
Key Findings
Prompt wording matters but has only modest effect.
On high-resource European pairs, ChatGPT is close to commercial systems.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BLEU | ChatGPT w/ TP3 24.73 | Google 31.66 | -6.93 | Flores Zh⇒En (sample of 50) | Table 3: prompt comparison | Table 3 |
| BLEU | ChatGPT De⇒En 43.71 | Google De⇒En 45.04 | -1.33 | Flores-101 (selected directions) | Table 4 (multilingual) | Table 4 |
What To Try In 7 Days
Run your most-used language pairs through ChatGPT/GPT-4 and compare BLEU or a small human review sample.
If a pair is low-resource or distant, try pivot prompting via English: ask model to output pivot then target.
Adopt the TP3 prompt template: 'Please provide the [TGT] translation for these sentences:' and sample 50 examples to spot common errors.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Small sample sizes: only 50 sentences sampled per test set due to web access constraints.
Results can vary across repeated queries; reported numbers are from single runs or limited versions.
When Not To Use
Do not rely on vanilla ChatGPT (GPT-3.5) for critical biomedical translation or high-stakes legal/medical text without expert review.
Avoid using ChatGPT for low-resource/distant language pairs without pivoting or extra validation.
Failure Modes
Hallucinations and mis-translations (extra or invented content).
Over-translation (adds content) and under-translation (omits content).

