LLMs (GPT-3.5 / GPT-4) can handle document translation and often beat commercial MT by human judgment

April 5, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper gives consistent human and targeted metric evidence across multiple datasets, but closed-source model updates and possible data contamination limit certainty.

Citations27

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.

Who Should Care

Summary TLDR

This paper tests large language models (mainly GPT-3.5 and GPT-4) on document-level machine translation. It compares prompt styles, commercial MT systems, and document-aware NMT methods across multiple benchmarks and domains. Key findings: contextual, multi-turn prompts improve translation; GPT-4 gives much better discourse-aware translations and explanations than GPT-3.5; human raters prefer GPT-4 outputs over commercial MT despite mixed automatic scores; supervised fine-tuning and RLHF appear to improve discourse skills. The authors release datasets, outputs, and annotations for reproducibility.

Problem Statement

Sentence-level MT ignores document context and produces inconsistent or incoherent translations on real documents. The paper asks: can large language models model discourse-level phenomena (consistency, zero-pronoun recovery, deixis, ellipsis) and how do prompts, training techniques, and existing document-NMT methods compare?

Main Contribution

Systematic evaluation of GPT-3.5 and GPT-4 on document-level MT across several benchmarks and domains.

Analysis of prompt designs showing multi-turn full-document prompts improve discourse handling.

Key Findings

Human raters prefer GPT-4 outputs over commercial MT systems on document translation.

NumbersHuman average (general/discourse): GPT-4 3.0/3.1 vs Google 1.7/1.8 (Table 4)

Practical UseIf final user perception matters, evaluate LLM outputs with human raters — GPT-4 can yield noticeably better perceived fluency and coherence than many commercial MT systems on tested domains.

Evidence RefTable 4

Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.

Numbersd-BLEU average: Tencent 26.0 vs GPT-4 25.5 (Table 4)

Practical UseDon’t rely solely on BLEU/d-BLEU. Use targeted discourse metrics and human evaluation when measuring document-level quality.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human (Average general/discourse)GPT-4 3.0/3.1, GPT-3.5 2.8/2.8, Google 1.7/1.8Google TranslateGPT-4 +1.3 (general) vs GoogleAverage over News, Social, Fiction, Q&A (Table 4)Table 4 human average columnsTable 4
Document BLEU (d-BLEU avg)Tencent 26.0, GPT-4 25.5, GPT-3.5 24.9Tencent TranSmartGPT-4 -0.5 vs TencentAverage over Zh→En domains (Table 4)Table 4 d-BLEU columnsTable 4

What To Try In 7 Days

Run P3-style multi-turn document prompts on representative documents and compare outputs to your current MT.

Collect quick human judgments on fluency and discourse (20–50 docs) rather than only BLEU scores.

Probe a few failure modes (deixis, pronoun recovery, repeated terms) with contrastive examples to find weak spots.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Results come from a limited set of datasets and domains; findings may not generalize to all languages or domains.

Closed-source, evolving LLMs reduce reproducibility; model updates can change outcomes.

When Not To Use

When strict, repeatable automatic ranking is required (LLMs can be unstable for fine-grained ranking).

On languages or domains absent from the model training data where contamination or domain bias may affect outputs.

Failure Modes

Omissions or copying behavior in long documents.

Instability across runs and sensitivity to prompt wording.

Core Entities

Models

GPT-3GPT-3.5GPT-4InstructGPTCodexGPTMCNG-TransSent2SentMR-Doc2DocMR-Doc2SentCADecDocRepairGoogle TranslateDeepLTencent TranSmart

Metrics

BLEUd-BLEUTERCOMETCTT (consistency of terminology translation)AccuracyHuman general/discourse scores

Datasets

mZPRT (Zh→En)WMT2022 (Zh→En)IWSLT2015 (Zh→En / En→De)IWSLT2017News Commentary v11 (En→De)Europarl v7 (En→De)OpenSub2018 (En→Ru)Contrastive testset (Voita et al., 2019b)

Benchmarks

GuoFeng (zero pronoun benchmark referenced)Contrastive discourse probe (Voita et al., 2019b)