Prompt LLMs to list and count major/minor translation errors to get human-like MT evaluations

March 24, 20237 min

Overview

Decision SnapshotReady For Pilot

EAPrompt is practically deployable for system-level evaluation and interpretable error reports; expect extra API cost, occasional unstable/invalid responses, and lower segment-level accuracy than supervised metrics.

Citations9

Evidence Strength0.80

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 60%

Authors

Qingyu Lu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, Dacheng Tao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EAPrompt makes LLM-based MT evaluation more interpretable and improves system-level ranking, letting teams replace some costly human MQM checks with cheaper automated analysis while keeping per-sentence caveats in mind.

Who Should Care

Summary TLDR

The paper introduces Error Analysis Prompting (EAPrompt): a two-step, one-shot prompt that asks an LLM to (1) identify itemized major/minor translation errors and (2) count them, then scores the translation by weighting errors. On the WMT22 test set (106k segments, 54 systems) EAPrompt improves system-level ranking over prior LLM prompting (GEMBA) and other metrics for GPT-3.5-Turbo (system pairwise accuracy 91.2%). EAPrompt also improves segment-level agreement with human MQM judgments in most cases, produces interpretable error lists, and can use regex counting to cut inference cost with small performance loss.

Problem Statement

Modern LLMs can rank MT systems well but perform poorly and uninterpretabily at the sentence (segment) level. The gap: LLM outputs often lack explicit error analyses that resemble human MQM judgments. The paper asks: can prompting LLMs to emulate human error analysis (major vs minor errors) yield explainable, human-like MT evaluation at system and segment levels?

Main Contribution

Proposes Error Analysis Prompting (EAPrompt): combine chain-of-thought style reasoning with itemized error identification and a separate counting step to emulate MQM.

Large-scale evaluation on WMT22 (106,758 segments, 54 systems) showing EAPrompt raises system-level pairwise accuracy and improves segment-level agreement versus GEMBA prompting.

Key Findings

EAPrompt raises system-level pairwise accuracy for GPT-3.5-Turbo.

NumbersSystem-level acc 91.2% (EAPrompt) vs 86.5% (GEMBA), +4.7

Practical UseIf you need better system-level ranking, use EAPrompt with GPT-3.5-Turbo instead of basic zero-shot prompts.

Evidence RefTable 2; §3.5

EAPrompt improves segment-level agreement with human MQM in nearly all cases versus GEMBA.

NumbersGPT-3.5-Turbo segment Acc En-De 56.7% vs 55.2% (+1.5); EAPrompt wins 8/9 scenarios

Practical UseFor per-sentence quality estimation, EAPrompt is generally better than GEMBA, but it still often trails supervised metrics fine-tuned on human data.

Evidence RefTable 2, Table 9; §3.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy91.2%GEMBA 86.5%+4.7WMT22 (All 3 language pairs)Table 2; §3.5
Segment-level pairwise acc (En-De) - GPT-3.5-Turbo56.7%GEMBA 55.2%+1.5WMT22 En-DeTable 2; §3.5

What To Try In 7 Days

Run EAPrompt (2-step, itemized) on your MT outputs using GPT-3.5 to compare system rankings with current metrics.

Replace the second LLM counting query with a regex parser of itemized bullets to cut API cost.

Tune the major-error weight (start at reported defaults) and compare segment-level agreement to human labels on a small sample.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

WMT22 metrics shared task (used as test set; MQM human annotations referenced in paper)

Risks & Boundaries

Limitations

Possible test-set contamination despite using WMT22; training data leakage may remain.

Evaluation on a limited set of LLMs and prompt choices due to budget constraints.

When Not To Use

When you need the highest possible per-sentence correlation with human MQM (use supervised metrics trained on human labels).

If you cannot afford extra LLM queries and cannot implement regex post-processing.

Failure Modes

LLM gives invalid or BLEU-like answers instead of itemized errors.

Input-order bias when evaluating multiple translations in one prompt.

Core Entities

Models

GPT-3.5-TurboLlama2-70b-ChatMixtral-8x7b-InstructGPT-4

Metrics

EAPromptGEMBABLEURT20COMET22COMET-QEUniTEMetricX-XXLMaTESe-QE

Datasets

WMT22

Benchmarks

WMT22 metrics shared task (MQM)