Prompt LLMs to list and count major/minor translation errors to get human-like MT evaluations

Overview

Decision SnapshotReady For Pilot

EAPrompt is practically deployable for system-level evaluation and interpretable error reports; expect extra API cost, occasional unstable/invalid responses, and lower segment-level accuracy than supervised metrics.

Citations9

Evidence Strength0.80

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 60%

Authors

Qingyu Lu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, Dacheng Tao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EAPrompt makes LLM-based MT evaluation more interpretable and improves system-level ranking, letting teams replace some costly human MQM checks with cheaper automated analysis while keeping per-sentence caveats in mind.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The paper introduces Error Analysis Prompting (EAPrompt): a two-step, one-shot prompt that asks an LLM to (1) identify itemized major/minor translation errors and (2) count them, then scores the translation by weighting errors. On the WMT22 test set (106k segments, 54 systems) EAPrompt improves system-level ranking over prior LLM prompting (GEMBA) and other metrics for GPT-3.5-Turbo (system pairwise accuracy 91.2%). EAPrompt also improves segment-level agreement with human MQM judgments in most cases, produces interpretable error lists, and can use regex counting to cut inference cost with small performance loss.

Problem Statement

Modern LLMs can rank MT systems well but perform poorly and uninterpretabily at the sentence (segment) level. The gap: LLM outputs often lack explicit error analyses that resemble human MQM judgments. The paper asks: can prompting LLMs to emulate human error analysis (major vs minor errors) yield explainable, human-like MT evaluation at system and segment levels?

Main Contribution

Proposes Error Analysis Prompting (EAPrompt): combine chain-of-thought style reasoning with itemized error identification and a separate counting step to emulate MQM.

Large-scale evaluation on WMT22 (106,758 segments, 54 systems) showing EAPrompt raises system-level pairwise accuracy and improves segment-level agreement versus GEMBA prompting.

Key Findings

EAPrompt raises system-level pairwise accuracy for GPT-3.5-Turbo.

NumbersSystem-level acc 91.2% (EAPrompt) vs 86.5% (GEMBA), +4.7

Practical UseIf you need better system-level ranking, use EAPrompt with GPT-3.5-Turbo instead of basic zero-shot prompts.

Evidence RefTable 2; §3.5

EAPrompt improves segment-level agreement with human MQM in nearly all cases versus GEMBA.

NumbersGPT-3.5-Turbo segment Acc En-De 56.7% vs 55.2% (+1.5); EAPrompt wins 8/9 scenarios

Practical UseFor per-sentence quality estimation, EAPrompt is generally better than GEMBA, but it still often trails supervised metrics fine-tuned on human data.

Evidence RefTable 2, Table 9; §3.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	91.2%	GEMBA 86.5%	+4.7	WMT22 (All 3 language pairs)	Table 2; §3.5	—
Segment-level pairwise acc (En-De) - GPT-3.5-Turbo	56.7%	GEMBA 55.2%	+1.5	WMT22 En-De	Table 2; §3.5	—

What To Try In 7 Days

Run EAPrompt (2-step, itemized) on your MT outputs using GPT-3.5 to compare system rankings with current metrics.

Replace the second LLM counting query with a regex parser of itemized bullets to cut API cost.

Tune the major-error weight (start at reported defaults) and compare segment-level agreement to human labels on a small sample.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt

Data URLs

WMT22 metrics shared task (used as test set; MQM human annotations referenced in paper)

Risks & Boundaries

Limitations

Possible test-set contamination despite using WMT22; training data leakage may remain.

Evaluation on a limited set of LLMs and prompt choices due to budget constraints.

When Not To Use

When you need the highest possible per-sentence correlation with human MQM (use supervised metrics trained on human labels).

If you cannot afford extra LLM queries and cannot implement regex post-processing.

Failure Modes

LLM gives invalid or BLEU-like answers instead of itemized errors.

Input-order bias when evaluating multiple translations in one prompt.

Core Entities

Models

GPT-3.5-TurboLlama2-70b-ChatMixtral-8x7b-InstructGPT-4

Metrics

EAPromptGEMBABLEURT20COMET22COMET-QEUniTEMetricX-XXLMaTESe-QE

Datasets

WMT22

Benchmarks

WMT22 metrics shared task (MQM)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EAPrompt raises system-level pairwise accuracy for GPT-3.5-Turbo.

EAPrompt improves segment-level agreement with human MQM in nearly all cases versus GEMBA.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding