Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

December 20, 20246 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Daniil Larionov, Steffen Eger

Links

Abstract / PDF

Why It Matters For Business

Compress evaluation prompts to cut LLM token bills by roughly 2.4× while keeping metric quality, making large-scale or repeated MT evaluations more affordable.

Summary TLDR

PromptOptMe trains a small model to compress the inputs of LLM-based MT evaluators (like GEMBA-MQM) in two stages: supervised fine-tuning to preserve error spans, then preference optimization (ORPO) to prefer compressions whose LLM scores match uncompressed prompts. On a 16k test set the best setup cuts token use up to 2.37× (19M → ≈8M tokens) while keeping system-level accuracy and often improving segment Kendall τ on evaluated language pairs. Task-agnostic compressors (LLMLingua-2) break metric quality, showing the need to preserve error spans.

Problem Statement

LLM-based MT metrics give strong human alignment but prompts are long and expensive (≈1100–1200 tokens per example). This paper asks: can a small model compress prompt inputs to cut token costs while preserving evaluation quality?

Main Contribution

A two-stage prompt compression pipeline: supervised fine-tuning to preserve MQM error spans, then ORPO preference optimization to prefer compressions whose LLM scores match uncompressed inputs.

An MT-focused compression model (PROMPTOPTME) based on LLaMA-3.2 1B/3B that reduces token use up to 2.37× on a 16k test set while retaining or improving evaluated metric quality.

Empirical comparison showing task-aware compression preserves metric quality, while task-agnostic compressors (LLMLingua-2) can catastrophically damage quality.

Key Findings

Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.

Numbers19M → 8.07M tokens (reduction 2.37×)

Prompt compression can preserve or improve measured metric quality on several language pairs.

NumbersEn-Ru τ: 0.4365→0.4455; En-De τ: 0.395→0.4065; Zh-En τ: 0.3692→0.3738

Task-agnostic prompt compressors can break MT metric quality.

NumbersLLMLingua-2 @50% pairwise accuracy 0.4736 vs baseline 0.7789; segment τ near 0

Larger compressor model improves compression and accuracy versus a smaller one.

NumbersPROMPTOPTME-3B vs 1B: 2.37× vs 2.15× reduction and pairwise 0.7736 vs 0.7644

Results

Token Usage

Value19M → 8.07M

Baseline19M (GPT-4o ref)

Accuracy

Value0.7736 (PROMPTOPTME-3B with GPT-4o lite)

Baseline0.7789 (GPT-4o ref)

Segment-level Kendall's τ (En-Ru)

Value0.4455 (PROMPTOPTME-3B)

Baseline0.4365 (GPT-4o ref)

Baseline compressor failure

ValueLLMLingua-2 @50% pairwise 0.4736

Baseline0.7789 (GPT-4o ref)

Who Should Care

What To Try In 7 Days

Measure token usage and quality of your current LLM MT metric on a small test slice.

Swap full prompt for the paper's simplified prompt and re-run a sample evaluation.

Train a small compressor (LoRA on LLaMA-3.2 3B) on MQM data and test token savings vs quality on 1k examples.

Optimization Features

Token Efficiency

  • Prompt Compression
  • Context Compression
  • Token Budgeting

Model Optimization

  • LoRA

Training Optimization

  • SFT
  • Preference optimization (ORPO)

Inference Optimization

  • Prompt input compression
  • Fixed simplified instruction template

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on machine translation using MQM-annotated WMT data; generalization to other NLG tasks is untested.
  • Preference data and evaluation rely on GPT-4o; learned compressor may encode evaluator-specific bias.
  • One openly available compressor backbone tested (LLaMA-3.2); broader model validation is missing.
  • Training cost: full finetuning + ORPO used about 186 GPU hours; inference cost of the compressor is not fully accounted.

When Not To Use

  • When no error-span annotations (MQM-like) are available for supervision.
  • When strict adherence to full MQM error-typology is required for interpretability.
  • If you cannot afford the one-time fine-tuning GPU budget or do not want an extra inference step.

Failure Modes

  • Task-agnostic compressors can severely reduce metric quality (near-zero segment τ).
  • Compressed prompts may omit fine-grained MQM categories, hurting error-category analyses.
  • Compressor trained on one evaluator (GPT-4o) may not generalize to other evaluators.

Core Entities

Models

  • PROMPTOPTME-3B (LLaMA-3.2 3B finetuned)
  • PROMPTOPTME-1B (LLaMA-3.2 1B finetuned)
  • GPT-4o (evaluator)
  • GPT-4o mini (evaluator)
  • LLaMA 3.2-90B (evaluator baseline)

Metrics

  • Accuracy
  • Segment-level Kendall's τ

Datasets

  • WMT Metrics MQM annotations (2020–2022)
  • WMT22 Metrics Challenge test set (mentioned; 60k example scale)
  • Test subset: ~16k news-domain examples (held-out)

Benchmarks

  • GEMBA-MQM (LLM-based MT metric)