LLMs (GPT-3.5 / GPT-4) can handle document translation and often beat commercial MT by human judgment

Overview

Decision SnapshotReady For Pilot

The paper gives consistent human and targeted metric evidence across multiple datasets, but closed-source model updates and possible data contamination limit certainty.

Citations27

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.

Who Should Care

Product Manager ML Engineer Data Scientist

Summary TLDR

This paper tests large language models (mainly GPT-3.5 and GPT-4) on document-level machine translation. It compares prompt styles, commercial MT systems, and document-aware NMT methods across multiple benchmarks and domains. Key findings: contextual, multi-turn prompts improve translation; GPT-4 gives much better discourse-aware translations and explanations than GPT-3.5; human raters prefer GPT-4 outputs over commercial MT despite mixed automatic scores; supervised fine-tuning and RLHF appear to improve discourse skills. The authors release datasets, outputs, and annotations for reproducibility.

Problem Statement

Sentence-level MT ignores document context and produces inconsistent or incoherent translations on real documents. The paper asks: can large language models model discourse-level phenomena (consistency, zero-pronoun recovery, deixis, ellipsis) and how do prompts, training techniques, and existing document-NMT methods compare?

Main Contribution

Systematic evaluation of GPT-3.5 and GPT-4 on document-level MT across several benchmarks and domains.

Analysis of prompt designs showing multi-turn full-document prompts improve discourse handling.

Key Findings

Human raters prefer GPT-4 outputs over commercial MT systems on document translation.

NumbersHuman average (general/discourse): GPT-4 3.0/3.1 vs Google 1.7/1.8 (Table 4)

Practical UseIf final user perception matters, evaluate LLM outputs with human raters — GPT-4 can yield noticeably better perceived fluency and coherence than many commercial MT systems on tested domains.

Evidence RefTable 4

Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.

Numbersd-BLEU average: Tencent 26.0 vs GPT-4 25.5 (Table 4)

Practical UseDon’t rely solely on BLEU/d-BLEU. Use targeted discourse metrics and human evaluation when measuring document-level quality.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human (Average general/discourse)	GPT-4 3.0/3.1, GPT-3.5 2.8/2.8, Google 1.7/1.8	Google Translate	GPT-4 +1.3 (general) vs Google	Average over News, Social, Fiction, Q&A (Table 4)	Table 4 human average columns	Table 4
Document BLEU (d-BLEU avg)	Tencent 26.0, GPT-4 25.5, GPT-3.5 24.9	Tencent TranSmart	GPT-4 -0.5 vs Tencent	Average over Zh→En domains (Table 4)	Table 4 d-BLEU columns	Table 4

What To Try In 7 Days

Run P3-style multi-turn document prompts on representative documents and compare outputs to your current MT.

Collect quick human judgments on fluency and discourse (20–50 docs) rather than only BLEU scores.

Probe a few failure modes (deixis, pronoun recovery, repeated terms) with contrastive examples to find weak spots.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/longyuewangdcu/Document-MT-LLM

Data URLs

https://github.com/longyuewangdcu/Document-MT-LLM

Risks & Boundaries

Limitations

Results come from a limited set of datasets and domains; findings may not generalize to all languages or domains.

Closed-source, evolving LLMs reduce reproducibility; model updates can change outcomes.

When Not To Use

When strict, repeatable automatic ranking is required (LLMs can be unstable for fine-grained ranking).

On languages or domains absent from the model training data where contamination or domain bias may affect outputs.

Failure Modes

Omissions or copying behavior in long documents.

Instability across runs and sensitivity to prompt wording.

Core Entities

Models

GPT-3GPT-3.5GPT-4InstructGPTCodexGPTMCNG-TransSent2SentMR-Doc2DocMR-Doc2SentCADecDocRepairGoogle TranslateDeepLTencent TranSmart

Metrics

BLEUd-BLEUTERCOMETCTT (consistency of terminology translation)AccuracyHuman general/discourse scores

Datasets

mZPRT (Zh→En)WMT2022 (Zh→En)IWSLT2015 (Zh→En / En→De)IWSLT2017News Commentary v11 (En→De)Europarl v7 (En→De)OpenSub2018 (En→Ru)Contrastive testset (Voita et al., 2019b)

Benchmarks

GuoFeng (zero pronoun benchmark referenced)Contrastive discourse probe (Voita et al., 2019b)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Human raters prefer GPT-4 outputs over commercial MT systems on document translation.

Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding