LLMs (GPT-3.5 / GPT-4) can handle document translation and often beat commercial MT by human judgment

April 5, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

27

Authors

Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, Zhaopeng Tu

Links

Abstract / PDF

Why It Matters For Business

LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.

Summary TLDR

This paper tests large language models (mainly GPT-3.5 and GPT-4) on document-level machine translation. It compares prompt styles, commercial MT systems, and document-aware NMT methods across multiple benchmarks and domains. Key findings: contextual, multi-turn prompts improve translation; GPT-4 gives much better discourse-aware translations and explanations than GPT-3.5; human raters prefer GPT-4 outputs over commercial MT despite mixed automatic scores; supervised fine-tuning and RLHF appear to improve discourse skills. The authors release datasets, outputs, and annotations for reproducibility.

Problem Statement

Sentence-level MT ignores document context and produces inconsistent or incoherent translations on real documents. The paper asks: can large language models model discourse-level phenomena (consistency, zero-pronoun recovery, deixis, ellipsis) and how do prompts, training techniques, and existing document-NMT methods compare?

Main Contribution

Systematic evaluation of GPT-3.5 and GPT-4 on document-level MT across several benchmarks and domains.

Analysis of prompt designs showing multi-turn full-document prompts improve discourse handling.

Probing method and benchmark for measuring discourse knowledge (prediction + explanation).

Comparison with commercial MT and document-aware NMT methods using automatic and human evaluations.

Public release of instruction-based benchmark, system outputs, and human annotations.

Key Findings

Human raters prefer GPT-4 outputs over commercial MT systems on document translation.

NumbersHuman average (general/discourse): GPT-4 3.0/3.1 vs Google 1.7/1.8 (Table 4)

Automatic metrics give mixed results; commercial systems sometimes match or beat LLMs on n-gram overlap.

Numbersd-BLEU average: Tencent 26.0 vs GPT-4 25.5 (Table 4)

A multi-turn, whole-document prompt (P3) improves translation and discourse metrics versus sentence-level prompts.

NumbersP3 vs Base (News BLEU): 26.5 vs 25.5; (Fiction BLEU): 14.4 vs 12.4 (Table 2)

GPT-4 is substantially better than GPT-3.5 at probing and explaining discourse phenomena.

NumbersDeixis prediction: GPT-4 85.9% vs GPT-3.5 57.9%; explanation accuracy (deixis): GPT-4 93% vs GPT-3.5 18% (Tables 7,8)

Training techniques like supervised finetuning, code pretraining, and RLHF improve document translation and discourse probing.

NumbersInstructGPT+FeedME-1 d-BLEU 14.1 → +PPO 17.2; GPT-4 d-BLEU 18.8 (Table 9)

Results

Human (Average general/discourse)

ValueGPT-4 3.0/3.1, GPT-3.5 2.8/2.8, Google 1.7/1.8

BaselineGoogle Translate

Document BLEU (d-BLEU avg)

ValueTencent 26.0, GPT-4 25.5, GPT-3.5 24.9

BaselineTencent TranSmart

Prompt impact (BLEU)

ValueP3 News 26.5 vs Base 25.5

BaselineBase (sentence-level)

Accuracy

ValueGPT-4 85.9% vs GPT-3.5 57.9%

BaselineSent2Sent 51.1%

Who Should Care

What To Try In 7 Days

Run P3-style multi-turn document prompts on representative documents and compare outputs to your current MT.

Collect quick human judgments on fluency and discourse (20–50 docs) rather than only BLEU scores.

Probe a few failure modes (deixis, pronoun recovery, repeated terms) with contrastive examples to find weak spots.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Results come from a limited set of datasets and domains; findings may not generalize to all languages or domains.
  • Closed-source, evolving LLMs reduce reproducibility; model updates can change outcomes.
  • Human evaluation criteria and granularity have room for refinement, especially in discourse scoring.

When Not To Use

  • When strict, repeatable automatic ranking is required (LLMs can be unstable for fine-grained ranking).
  • On languages or domains absent from the model training data where contamination or domain bias may affect outputs.
  • When you need full reproducibility of the exact model version without archives.

Failure Modes

  • Omissions or copying behavior in long documents.
  • Instability across runs and sensitivity to prompt wording.
  • Domain bias from training data causing poor performance on specialized registers.

Core Entities

Models

  • GPT-3
  • GPT-3.5
  • GPT-4
  • InstructGPT
  • CodexGPT
  • MCN
  • G-Trans
  • Sent2Sent
  • MR-Doc2Doc
  • MR-Doc2Sent
  • CADec
  • DocRepair
  • Google Translate
  • DeepL
  • Tencent TranSmart

Metrics

  • BLEU
  • d-BLEU
  • TER
  • COMET
  • CTT (consistency of terminology translation)
  • Accuracy
  • Human general/discourse scores

Datasets

  • mZPRT (Zh→En)
  • WMT2022 (Zh→En)
  • IWSLT2015 (Zh→En / En→De)
  • IWSLT2017
  • News Commentary v11 (En→De)
  • Europarl v7 (En→De)
  • OpenSub2018 (En→Ru)
  • Contrastive testset (Voita et al., 2019b)

Benchmarks

  • GuoFeng (zero pronoun benchmark referenced)
  • Contrastive discourse probe (Voita et al., 2019b)