ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

Overview

Decision SnapshotNeeds Validation

Study uses standard benchmarks and human checks but is limited to small random samples (50 each) and web-accessed ChatGPT; results are indicative but not definitive.

Citations313

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 9/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 20%

Authors

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

This empirical study tests ChatGPT (GPT-3.5) and GPT-4 on public MT benchmarks. With the default ChatGPT engine, translations are competitive on high-resource European pairs but weaker on low-resource or distant languages and on domain/noisy text. Two practical fixes improve results: (1) pivot prompting (translate via a high-resource language) and (2) using GPT-4 as the engine, which brings quality close to commercial systems on tested directions.

Problem Statement

Can ChatGPT serve as a practical machine translator? If not, what helps it compete with commercial systems across languages and domains?

Main Contribution

A focused evaluation of ChatGPT (GPT-3.5) on multilingual translation and robustness using Flores-101 and WMT robustness/biomedical test sets.

Comparison versus three commercial translators (Google Translate, DeepL, Tencent TranSmart) using automatic metrics (BLEU, ChrF++, TER) and human annotation.

Key Findings

Prompt wording matters but has only modest effect.

NumbersBest prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

Practical UseUse a clear, explicit translation prompt (TP3 style). Expect small but consistent gains from prompt choice.

Evidence RefTable 3

On high-resource European pairs, ChatGPT is close to commercial systems.

NumbersDe⇒En: Google 45.04 vs ChatGPT 43.71 BLEU (Table 4).

Practical UseFor well-resourced European language pairs, ChatGPT (GPT-3.5) is a viable option for prototyping and many applications.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU	ChatGPT w/ TP3 24.73	Google 31.66	-6.93	Flores Zh⇒En (sample of 50)	Table 3: prompt comparison	Table 3
BLEU	ChatGPT De⇒En 43.71	Google De⇒En 45.04	-1.33	Flores-101 (selected directions)	Table 4 (multilingual)	Table 4

What To Try In 7 Days

Run your most-used language pairs through ChatGPT/GPT-4 and compare BLEU or a small human review sample.

If a pair is low-resource or distant, try pivot prompting via English: ask model to output pivot then target.

Adopt the TP3 prompt template: 'Please provide the [TGT] translation for these sentences:' and sample 50 examples to spot common errors.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/wxjiao/Is-ChatGPT-A-Good-Translator

Data URLs

https://github.com/facebookresearch/flores https://github.com/mjpost/sacrebleuWMT test sets (WMT19 Bio, WMT20 Rob2/Rob3)

Risks & Boundaries

Limitations

Small sample sizes: only 50 sentences sampled per test set due to web access constraints.

Results can vary across repeated queries; reported numbers are from single runs or limited versions.

When Not To Use

Do not rely on vanilla ChatGPT (GPT-3.5) for critical biomedical translation or high-stakes legal/medical text without expert review.

Avoid using ChatGPT for low-resource/distant language pairs without pivoting or extra validation.

Failure Modes

Hallucinations and mis-translations (extra or invented content).

Over-translation (adds content) and under-translation (omits content).

Core Entities

Models

ChatGPT (GPT-3.5)GPT-4Google TranslateDeepLTencent TranSmart

Metrics

BLEUChrF++TERSacreBLEU

Datasets

Flores-101WMT19 Biomedical (Bio)WMT20 Robustness set2 (Rob2)WMT20 Robustness set3 (Rob3)

Benchmarks

Flores-101WMT19 BioWMT20 Rob2WMT20 Rob3

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt wording matters but has only modest effect.

On high-resource European pairs, ChatGPT is close to commercial systems.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding