Overview
The paper gives consistent human and targeted metric evidence across multiple datasets, but closed-source model updates and possible data contamination limit certainty.
Citations27
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.
Who Should Care
Summary TLDR
This paper tests large language models (mainly GPT-3.5 and GPT-4) on document-level machine translation. It compares prompt styles, commercial MT systems, and document-aware NMT methods across multiple benchmarks and domains. Key findings: contextual, multi-turn prompts improve translation; GPT-4 gives much better discourse-aware translations and explanations than GPT-3.5; human raters prefer GPT-4 outputs over commercial MT despite mixed automatic scores; supervised fine-tuning and RLHF appear to improve discourse skills. The authors release datasets, outputs, and annotations for reproducibility.
Problem Statement
Sentence-level MT ignores document context and produces inconsistent or incoherent translations on real documents. The paper asks: can large language models model discourse-level phenomena (consistency, zero-pronoun recovery, deixis, ellipsis) and how do prompts, training techniques, and existing document-NMT methods compare?
Main Contribution
Systematic evaluation of GPT-3.5 and GPT-4 on document-level MT across several benchmarks and domains.
Analysis of prompt designs showing multi-turn full-document prompts improve discourse handling.
Key Findings
Human raters prefer GPT-4 outputs over commercial MT systems on document translation.
Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human (Average general/discourse) | GPT-4 3.0/3.1, GPT-3.5 2.8/2.8, Google 1.7/1.8 | Google Translate | GPT-4 +1.3 (general) vs Google | Average over News, Social, Fiction, Q&A (Table 4) | Table 4 human average columns | Table 4 |
| Document BLEU (d-BLEU avg) | Tencent 26.0, GPT-4 25.5, GPT-3.5 24.9 | Tencent TranSmart | GPT-4 -0.5 vs Tencent | Average over Zh→En domains (Table 4) | Table 4 d-BLEU columns | Table 4 |
What To Try In 7 Days
Run P3-style multi-turn document prompts on representative documents and compare outputs to your current MT.
Collect quick human judgments on fluency and discourse (20–50 docs) rather than only BLEU scores.
Probe a few failure modes (deixis, pronoun recovery, repeated terms) with contrastive examples to find weak spots.
Reproducibility
Risks & Boundaries
Limitations
Results come from a limited set of datasets and domains; findings may not generalize to all languages or domains.
Closed-source, evolving LLMs reduce reproducibility; model updates can change outcomes.
When Not To Use
When strict, repeatable automatic ranking is required (LLMs can be unstable for fine-grained ranking).
On languages or domains absent from the model training data where contamination or domain bias may affect outputs.
Failure Modes
Omissions or copying behavior in long documents.
Instability across runs and sensitivity to prompt wording.

