ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

January 20, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.2

Cost Impact Score

0.6

Citation Count

313

Authors

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF

Why It Matters For Business

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Summary TLDR

This empirical study tests ChatGPT (GPT-3.5) and GPT-4 on public MT benchmarks. With the default ChatGPT engine, translations are competitive on high-resource European pairs but weaker on low-resource or distant languages and on domain/noisy text. Two practical fixes improve results: (1) pivot prompting (translate via a high-resource language) and (2) using GPT-4 as the engine, which brings quality close to commercial systems on tested directions.

Problem Statement

Can ChatGPT serve as a practical machine translator? If not, what helps it compete with commercial systems across languages and domains?

Main Contribution

A focused evaluation of ChatGPT (GPT-3.5) on multilingual translation and robustness using Flores-101 and WMT robustness/biomedical test sets.

Comparison versus three commercial translators (Google Translate, DeepL, Tencent TranSmart) using automatic metrics (BLEU, ChrF++, TER) and human annotation.

Two practical improvements tested: (a) pivot prompting via a high-resource language and (b) re-running with GPT-4; both show measurable gains.

Key Findings

Prompt wording matters but has only modest effect.

NumbersBest prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

On high-resource European pairs, ChatGPT is close to commercial systems.

NumbersDe⇒En: Google 45.04 vs ChatGPT 43.71 BLEU (Table 4).

ChatGPT lags badly on some low-resource directions.

NumbersEn⇒Ro BLEU is 46.4% lower than Google on evaluated tests (text and Table 4 discussion).

ChatGPT is weaker on domain-specific and noisy text but better on spoken-language transcripts.

NumbersWMT19 Bio De⇒En: Google 37.83 vs ChatGPT 33.22 BLEU; WMT20 Rob3 De⇒En: ChatGPT 44.59 vs Google 42.91 BLEU (Table 5).

Pivot prompting noticeably improves distant-language translation.

NumbersDe⇒Zh: Direct new 30.76 → Pivot 34.68 (+3.92 BLEU). Ro⇒Zh: 27.51 → 34.19 (+6.68 BLEU) (Table 7).

Switching to GPT-4 gives a broad quality boost and often reaches commercial levels.

NumbersGPT-4 De⇒En BLEU 46.00 vs Google 45.04; Zh⇒En GPT-4 28.50 (Table 8).

Human annotation shows GPT-4 makes fewer translation errors than ChatGPT and Google.

NumbersHuman ranking: GPT-4 ranked best 32/50 examples; ChatGPT ranked best 11/50 (Table 12). Mis-translation counts: ChatGPT 2

Results

BLEU

ValueChatGPT w/ TP3 24.73

BaselineGoogle 31.66

BLEU

ValueChatGPT De⇒En 43.71

BaselineGoogle De⇒En 45.04

BLEU

ValueEn⇒Ro ChatGPT lower by 46.4%

BaselineGoogle En⇒Ro (reference)

BLEU

ValueWMT19 Bio ChatGPT 33.22

BaselineGoogle 37.83

BLEU

ValueWMT20 Rob3 ChatGPT 44.59

BaselineGoogle 42.91

BLEU

ValuePivot prompting De⇒Zh 34.68

BaselineDirect new 30.76

BLEU

ValuePivot prompting Ro⇒Zh 34.19

BaselineDirect new 27.51

BLEU

ValueGPT-4 De⇒En 46.00

BaselineGoogle 45.04

Human ranking

ValueGPT-4 ranked best 32 / 50

BaselineChatGPT ranked best 11 / 50

Who Should Care

What To Try In 7 Days

Run your most-used language pairs through ChatGPT/GPT-4 and compare BLEU or a small human review sample.

If a pair is low-resource or distant, try pivot prompting via English: ask model to output pivot then target.

Adopt the TP3 prompt template: 'Please provide the [TGT] translation for these sentences:' and sample 50 examples to spot common errors.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small sample sizes: only 50 sentences sampled per test set due to web access constraints.
  • Results can vary across repeated queries; reported numbers are from single runs or limited versions.
  • Scope limited to multilingual quality and robustness; no document-level, constrained decoding, or production latency analysis.
  • Evaluation focused on a subset of languages and domains; not exhaustive.

When Not To Use

  • Do not rely on vanilla ChatGPT (GPT-3.5) for critical biomedical translation or high-stakes legal/medical text without expert review.
  • Avoid using ChatGPT for low-resource/distant language pairs without pivoting or extra validation.

Failure Modes

  • Hallucinations and mis-translations (extra or invented content).
  • Over-translation (adds content) and under-translation (omits content).
  • Inconsistent outputs across repeated runs leading to evaluation variance.
  • Short sentence errors due to differences in abbreviation vs. full-name choices (hurts BLEU).

Core Entities

Models

  • ChatGPT (GPT-3.5)
  • GPT-4
  • Google Translate
  • DeepL
  • Tencent TranSmart

Metrics

  • BLEU
  • ChrF++
  • TER
  • SacreBLEU

Datasets

  • Flores-101
  • WMT19 Biomedical (Bio)
  • WMT20 Robustness set2 (Rob2)
  • WMT20 Robustness set3 (Rob3)

Benchmarks

  • Flores-101
  • WMT19 Bio
  • WMT20 Rob2
  • WMT20 Rob3