Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
9
Why It Matters For Business
EAPrompt makes LLM-based MT evaluation more interpretable and improves system-level ranking, letting teams replace some costly human MQM checks with cheaper automated analysis while keeping per-sentence caveats in mind.
Summary TLDR
The paper introduces Error Analysis Prompting (EAPrompt): a two-step, one-shot prompt that asks an LLM to (1) identify itemized major/minor translation errors and (2) count them, then scores the translation by weighting errors. On the WMT22 test set (106k segments, 54 systems) EAPrompt improves system-level ranking over prior LLM prompting (GEMBA) and other metrics for GPT-3.5-Turbo (system pairwise accuracy 91.2%). EAPrompt also improves segment-level agreement with human MQM judgments in most cases, produces interpretable error lists, and can use regex counting to cut inference cost with small performance loss.
Problem Statement
Modern LLMs can rank MT systems well but perform poorly and uninterpretabily at the sentence (segment) level. The gap: LLM outputs often lack explicit error analyses that resemble human MQM judgments. The paper asks: can prompting LLMs to emulate human error analysis (major vs minor errors) yield explainable, human-like MT evaluation at system and segment levels?
Main Contribution
Proposes Error Analysis Prompting (EAPrompt): combine chain-of-thought style reasoning with itemized error identification and a separate counting step to emulate MQM.
Large-scale evaluation on WMT22 (106,758 segments, 54 systems) showing EAPrompt raises system-level pairwise accuracy and improves segment-level agreement versus GEMBA prompting.
Practical variants and cost tweaks: recommend a 2-step itemized prompt and show regex-based counting can replace a second LLM call with little loss.
Key Findings
EAPrompt raises system-level pairwise accuracy for GPT-3.5-Turbo.
EAPrompt improves segment-level agreement with human MQM in nearly all cases versus GEMBA.
EAPrompt produces human-like error distributions and discriminates major from minor errors.
Replacing the counting query with regex matching cuts inference cost with small performance loss.
Best prompt design is two-step + itemized error demo.
Results
Accuracy
Segment-level pairwise acc (En-De) - GPT-3.5-Turbo
Segment-level pairwise acc (En-Ru) - GPT-3.5-Turbo
Segment-level pairwise acc (Zh-En) - GPT-3.5-Turbo
Who Should Care
What To Try In 7 Days
Run EAPrompt (2-step, itemized) on your MT outputs using GPT-3.5 to compare system rankings with current metrics.
Replace the second LLM counting query with a regex parser of itemized bullets to cut API cost.
Tune the major-error weight (start at reported defaults) and compare segment-level agreement to human labels on a small sample.
Reproducibility
Data Urls
- WMT22 metrics shared task (used as test set; MQM human annotations referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible test-set contamination despite using WMT22; training data leakage may remain.
- Evaluation on a limited set of LLMs and prompt choices due to budget constraints.
- EAPrompt improves but does not fully match supervised, fine-tuned metrics at segment level.
When Not To Use
- When you need the highest possible per-sentence correlation with human MQM (use supervised metrics trained on human labels).
- If you cannot afford extra LLM queries and cannot implement regex post-processing.
- When your test data may be present in LLM training sets and contamination would bias results.
Failure Modes
- LLM gives invalid or BLEU-like answers instead of itemized errors.
- Input-order bias when evaluating multiple translations in one prompt.
- Response variability if sampling temperature is not controlled.
- Miscounting or misclassifying errors when prompts are too verbose.
Core Entities
Models
- GPT-3.5-Turbo
- Llama2-70b-Chat
- Mixtral-8x7b-Instruct
- GPT-4
Metrics
- EAPrompt
- GEMBA
- BLEURT20
- COMET22
- COMET-QE
- UniTE
- MetricX-XXL
- MaTESe-QE
Datasets
- WMT22
Benchmarks
- WMT22 metrics shared task (MQM)

