ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

January 20, 20238 min

Overview

Decision SnapshotNeeds Validation

Study uses standard benchmarks and human checks but is limited to small random samples (50 each) and web-accessed ChatGPT; results are indicative but not definitive.

Citations313

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 9/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 20%

Authors

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Who Should Care

Summary TLDR

This empirical study tests ChatGPT (GPT-3.5) and GPT-4 on public MT benchmarks. With the default ChatGPT engine, translations are competitive on high-resource European pairs but weaker on low-resource or distant languages and on domain/noisy text. Two practical fixes improve results: (1) pivot prompting (translate via a high-resource language) and (2) using GPT-4 as the engine, which brings quality close to commercial systems on tested directions.

Problem Statement

Can ChatGPT serve as a practical machine translator? If not, what helps it compete with commercial systems across languages and domains?

Main Contribution

A focused evaluation of ChatGPT (GPT-3.5) on multilingual translation and robustness using Flores-101 and WMT robustness/biomedical test sets.

Comparison versus three commercial translators (Google Translate, DeepL, Tencent TranSmart) using automatic metrics (BLEU, ChrF++, TER) and human annotation.

Key Findings

Prompt wording matters but has only modest effect.

NumbersBest prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

Practical UseUse a clear, explicit translation prompt (TP3 style). Expect small but consistent gains from prompt choice.

Evidence RefTable 3

On high-resource European pairs, ChatGPT is close to commercial systems.

NumbersDe⇒En: Google 45.04 vs ChatGPT 43.71 BLEU (Table 4).

Practical UseFor well-resourced European language pairs, ChatGPT (GPT-3.5) is a viable option for prototyping and many applications.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEUChatGPT w/ TP3 24.73Google 31.66-6.93Flores Zh⇒En (sample of 50)Table 3: prompt comparisonTable 3
BLEUChatGPT De⇒En 43.71Google De⇒En 45.04-1.33Flores-101 (selected directions)Table 4 (multilingual)Table 4

What To Try In 7 Days

Run your most-used language pairs through ChatGPT/GPT-4 and compare BLEU or a small human review sample.

If a pair is low-resource or distant, try pivot prompting via English: ask model to output pivot then target.

Adopt the TP3 prompt template: 'Please provide the [TGT] translation for these sentences:' and sample 50 examples to spot common errors.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small sample sizes: only 50 sentences sampled per test set due to web access constraints.

Results can vary across repeated queries; reported numbers are from single runs or limited versions.

When Not To Use

Do not rely on vanilla ChatGPT (GPT-3.5) for critical biomedical translation or high-stakes legal/medical text without expert review.

Avoid using ChatGPT for low-resource/distant language pairs without pivoting or extra validation.

Failure Modes

Hallucinations and mis-translations (extra or invented content).

Over-translation (adds content) and under-translation (omits content).

Core Entities

Models

ChatGPT (GPT-3.5)GPT-4Google TranslateDeepLTencent TranSmart

Metrics

BLEUChrF++TERSacreBLEU

Datasets

Flores-101WMT19 Biomedical (Bio)WMT20 Robustness set2 (Rob2)WMT20 Robustness set3 (Rob3)

Benchmarks

Flores-101WMT19 BioWMT20 Rob2WMT20 Rob3