Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Compress evaluation prompts to cut LLM token bills by roughly 2.4× while keeping metric quality, making large-scale or repeated MT evaluations more affordable.
Summary TLDR
PromptOptMe trains a small model to compress the inputs of LLM-based MT evaluators (like GEMBA-MQM) in two stages: supervised fine-tuning to preserve error spans, then preference optimization (ORPO) to prefer compressions whose LLM scores match uncompressed prompts. On a 16k test set the best setup cuts token use up to 2.37× (19M → ≈8M tokens) while keeping system-level accuracy and often improving segment Kendall τ on evaluated language pairs. Task-agnostic compressors (LLMLingua-2) break metric quality, showing the need to preserve error spans.
Problem Statement
LLM-based MT metrics give strong human alignment but prompts are long and expensive (≈1100–1200 tokens per example). This paper asks: can a small model compress prompt inputs to cut token costs while preserving evaluation quality?
Main Contribution
A two-stage prompt compression pipeline: supervised fine-tuning to preserve MQM error spans, then ORPO preference optimization to prefer compressions whose LLM scores match uncompressed inputs.
An MT-focused compression model (PROMPTOPTME) based on LLaMA-3.2 1B/3B that reduces token use up to 2.37× on a 16k test set while retaining or improving evaluated metric quality.
Empirical comparison showing task-aware compression preserves metric quality, while task-agnostic compressors (LLMLingua-2) can catastrophically damage quality.
Key Findings
Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.
Prompt compression can preserve or improve measured metric quality on several language pairs.
Task-agnostic prompt compressors can break MT metric quality.
Larger compressor model improves compression and accuracy versus a smaller one.
Results
Token Usage
Accuracy
Segment-level Kendall's τ (En-Ru)
Baseline compressor failure
Who Should Care
What To Try In 7 Days
Measure token usage and quality of your current LLM MT metric on a small test slice.
Swap full prompt for the paper's simplified prompt and re-run a sample evaluation.
Train a small compressor (LoRA on LLaMA-3.2 3B) on MQM data and test token savings vs quality on 1k examples.
Optimization Features
Token Efficiency
- Prompt Compression
- Context Compression
- Token Budgeting
Model Optimization
- LoRA
Training Optimization
- SFT
- Preference optimization (ORPO)
Inference Optimization
- Prompt input compression
- Fixed simplified instruction template
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on machine translation using MQM-annotated WMT data; generalization to other NLG tasks is untested.
- Preference data and evaluation rely on GPT-4o; learned compressor may encode evaluator-specific bias.
- One openly available compressor backbone tested (LLaMA-3.2); broader model validation is missing.
- Training cost: full finetuning + ORPO used about 186 GPU hours; inference cost of the compressor is not fully accounted.
When Not To Use
- When no error-span annotations (MQM-like) are available for supervision.
- When strict adherence to full MQM error-typology is required for interpretability.
- If you cannot afford the one-time fine-tuning GPU budget or do not want an extra inference step.
Failure Modes
- Task-agnostic compressors can severely reduce metric quality (near-zero segment τ).
- Compressed prompts may omit fine-grained MQM categories, hurting error-category analyses.
- Compressor trained on one evaluator (GPT-4o) may not generalize to other evaluators.
Core Entities
Models
- PROMPTOPTME-3B (LLaMA-3.2 3B finetuned)
- PROMPTOPTME-1B (LLaMA-3.2 1B finetuned)
- GPT-4o (evaluator)
- GPT-4o mini (evaluator)
- LLaMA 3.2-90B (evaluator baseline)
Metrics
- Accuracy
- Segment-level Kendall's τ
Datasets
- WMT Metrics MQM annotations (2020–2022)
- WMT22 Metrics Challenge test set (mentioned; 60k example scale)
- Test subset: ~16k news-domain examples (held-out)
Benchmarks
- GEMBA-MQM (LLM-based MT metric)

