Prompt LLMs to list and count major/minor translation errors to get human-like MT evaluations

March 24, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

9

Authors

Qingyu Lu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, Dacheng Tao

Links

Abstract / PDF

Why It Matters For Business

EAPrompt makes LLM-based MT evaluation more interpretable and improves system-level ranking, letting teams replace some costly human MQM checks with cheaper automated analysis while keeping per-sentence caveats in mind.

Summary TLDR

The paper introduces Error Analysis Prompting (EAPrompt): a two-step, one-shot prompt that asks an LLM to (1) identify itemized major/minor translation errors and (2) count them, then scores the translation by weighting errors. On the WMT22 test set (106k segments, 54 systems) EAPrompt improves system-level ranking over prior LLM prompting (GEMBA) and other metrics for GPT-3.5-Turbo (system pairwise accuracy 91.2%). EAPrompt also improves segment-level agreement with human MQM judgments in most cases, produces interpretable error lists, and can use regex counting to cut inference cost with small performance loss.

Problem Statement

Modern LLMs can rank MT systems well but perform poorly and uninterpretabily at the sentence (segment) level. The gap: LLM outputs often lack explicit error analyses that resemble human MQM judgments. The paper asks: can prompting LLMs to emulate human error analysis (major vs minor errors) yield explainable, human-like MT evaluation at system and segment levels?

Main Contribution

Proposes Error Analysis Prompting (EAPrompt): combine chain-of-thought style reasoning with itemized error identification and a separate counting step to emulate MQM.

Large-scale evaluation on WMT22 (106,758 segments, 54 systems) showing EAPrompt raises system-level pairwise accuracy and improves segment-level agreement versus GEMBA prompting.

Practical variants and cost tweaks: recommend a 2-step itemized prompt and show regex-based counting can replace a second LLM call with little loss.

Key Findings

EAPrompt raises system-level pairwise accuracy for GPT-3.5-Turbo.

NumbersSystem-level acc 91.2% (EAPrompt) vs 86.5% (GEMBA), +4.7

EAPrompt improves segment-level agreement with human MQM in nearly all cases versus GEMBA.

NumbersGPT-3.5-Turbo segment Acc En-De 56.7% vs 55.2% (+1.5); EAPrompt wins 8/9 scenarios

EAPrompt produces human-like error distributions and discriminates major from minor errors.

NumbersChosen major-error weights w*major: GPT-3.5=6, Llama2=10, Mixtral=10; performance drops when wmajor < 3

Replacing the counting query with regex matching cuts inference cost with small performance loss.

NumbersGPT-3.5-Turbo system acc: 90.1% (regex) vs 91.2% (LLM count), −1.1

Best prompt design is two-step + itemized error demo.

Numbers2-step itemized EAPrompt achieves top system Acc across tested LLMs (see Table 3)

Results

Accuracy

Value91.2%

BaselineGEMBA 86.5%

Segment-level pairwise acc (En-De) - GPT-3.5-Turbo

Value56.7%

BaselineGEMBA 55.2%

Segment-level pairwise acc (En-Ru) - GPT-3.5-Turbo

Value53.4%

BaselineGEMBA 49.5%

Segment-level pairwise acc (Zh-En) - GPT-3.5-Turbo

Value50.0%

BaselineGEMBA 48.2%

Who Should Care

What To Try In 7 Days

Run EAPrompt (2-step, itemized) on your MT outputs using GPT-3.5 to compare system rankings with current metrics.

Replace the second LLM counting query with a regex parser of itemized bullets to cut API cost.

Tune the major-error weight (start at reported defaults) and compare segment-level agreement to human labels on a small sample.

Reproducibility

Data Urls

  • WMT22 metrics shared task (used as test set; MQM human annotations referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible test-set contamination despite using WMT22; training data leakage may remain.
  • Evaluation on a limited set of LLMs and prompt choices due to budget constraints.
  • EAPrompt improves but does not fully match supervised, fine-tuned metrics at segment level.

When Not To Use

  • When you need the highest possible per-sentence correlation with human MQM (use supervised metrics trained on human labels).
  • If you cannot afford extra LLM queries and cannot implement regex post-processing.
  • When your test data may be present in LLM training sets and contamination would bias results.

Failure Modes

  • LLM gives invalid or BLEU-like answers instead of itemized errors.
  • Input-order bias when evaluating multiple translations in one prompt.
  • Response variability if sampling temperature is not controlled.
  • Miscounting or misclassifying errors when prompts are too verbose.

Core Entities

Models

  • GPT-3.5-Turbo
  • Llama2-70b-Chat
  • Mixtral-8x7b-Instruct
  • GPT-4

Metrics

  • EAPrompt
  • GEMBA
  • BLEURT20
  • COMET22
  • COMET-QE
  • UniTE
  • MetricX-XXL
  • MaTESe-QE

Datasets

  • WMT22

Benchmarks

  • WMT22 metrics shared task (MQM)