Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
27
Why It Matters For Business
LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.
Summary TLDR
This paper tests large language models (mainly GPT-3.5 and GPT-4) on document-level machine translation. It compares prompt styles, commercial MT systems, and document-aware NMT methods across multiple benchmarks and domains. Key findings: contextual, multi-turn prompts improve translation; GPT-4 gives much better discourse-aware translations and explanations than GPT-3.5; human raters prefer GPT-4 outputs over commercial MT despite mixed automatic scores; supervised fine-tuning and RLHF appear to improve discourse skills. The authors release datasets, outputs, and annotations for reproducibility.
Problem Statement
Sentence-level MT ignores document context and produces inconsistent or incoherent translations on real documents. The paper asks: can large language models model discourse-level phenomena (consistency, zero-pronoun recovery, deixis, ellipsis) and how do prompts, training techniques, and existing document-NMT methods compare?
Main Contribution
Systematic evaluation of GPT-3.5 and GPT-4 on document-level MT across several benchmarks and domains.
Analysis of prompt designs showing multi-turn full-document prompts improve discourse handling.
Probing method and benchmark for measuring discourse knowledge (prediction + explanation).
Comparison with commercial MT and document-aware NMT methods using automatic and human evaluations.
Public release of instruction-based benchmark, system outputs, and human annotations.
Key Findings
Human raters prefer GPT-4 outputs over commercial MT systems on document translation.
Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.
A multi-turn, whole-document prompt (P3) improves translation and discourse metrics versus sentence-level prompts.
GPT-4 is substantially better than GPT-3.5 at probing and explaining discourse phenomena.
Training techniques like supervised finetuning, code pretraining, and RLHF improve document translation and discourse probing.
Results
Human (Average general/discourse)
Document BLEU (d-BLEU avg)
Prompt impact (BLEU)
Accuracy
Who Should Care
What To Try In 7 Days
Run P3-style multi-turn document prompts on representative documents and compare outputs to your current MT.
Collect quick human judgments on fluency and discourse (20–50 docs) rather than only BLEU scores.
Probe a few failure modes (deixis, pronoun recovery, repeated terms) with contrastive examples to find weak spots.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Results come from a limited set of datasets and domains; findings may not generalize to all languages or domains.
- Closed-source, evolving LLMs reduce reproducibility; model updates can change outcomes.
- Human evaluation criteria and granularity have room for refinement, especially in discourse scoring.
When Not To Use
- When strict, repeatable automatic ranking is required (LLMs can be unstable for fine-grained ranking).
- On languages or domains absent from the model training data where contamination or domain bias may affect outputs.
- When you need full reproducibility of the exact model version without archives.
Failure Modes
- Omissions or copying behavior in long documents.
- Instability across runs and sensitivity to prompt wording.
- Domain bias from training data causing poor performance on specialized registers.
Core Entities
Models
- GPT-3
- GPT-3.5
- GPT-4
- InstructGPT
- CodexGPT
- MCN
- G-Trans
- Sent2Sent
- MR-Doc2Doc
- MR-Doc2Sent
- CADec
- DocRepair
- Google Translate
- DeepL
- Tencent TranSmart
Metrics
- BLEU
- d-BLEU
- TER
- COMET
- CTT (consistency of terminology translation)
- Accuracy
- Human general/discourse scores
Datasets
- mZPRT (Zh→En)
- WMT2022 (Zh→En)
- IWSLT2015 (Zh→En / En→De)
- IWSLT2017
- News Commentary v11 (En→De)
- Europarl v7 (En→De)
- OpenSub2018 (En→Ru)
- Contrastive testset (Voita et al., 2019b)
Benchmarks
- GuoFeng (zero pronoun benchmark referenced)
- Contrastive discourse probe (Voita et al., 2019b)

