Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

Overview

Decision SnapshotReady For Pilot

The method shows strong empirical gains on a 16k MT test set and uses standard components (LoRA, ORPO), but is validated only on MT and on preferences collected with GPT-4o, so wider production readiness requires more cross-model and task checks.

Citations0

Evidence Strength0.80

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Daniil Larionov, Steffen Eger

Links

Abstract / PDF

Why It Matters For Business

Compress evaluation prompts to cut LLM token bills by roughly 2.4× while keeping metric quality, making large-scale or repeated MT evaluations more affordable.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

PromptOptMe trains a small model to compress the inputs of LLM-based MT evaluators (like GEMBA-MQM) in two stages: supervised fine-tuning to preserve error spans, then preference optimization (ORPO) to prefer compressions whose LLM scores match uncompressed prompts. On a 16k test set the best setup cuts token use up to 2.37× (19M → ≈8M tokens) while keeping system-level accuracy and often improving segment Kendall τ on evaluated language pairs. Task-agnostic compressors (LLMLingua-2) break metric quality, showing the need to preserve error spans.

Problem Statement

LLM-based MT metrics give strong human alignment but prompts are long and expensive (≈1100–1200 tokens per example). This paper asks: can a small model compress prompt inputs to cut token costs while preserving evaluation quality?

Main Contribution

A two-stage prompt compression pipeline: supervised fine-tuning to preserve MQM error spans, then ORPO preference optimization to prefer compressions whose LLM scores match uncompressed inputs.

An MT-focused compression model (PROMPTOPTME) based on LLaMA-3.2 1B/3B that reduces token use up to 2.37× on a 16k test set while retaining or improving evaluated metric quality.

Key Findings

Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.

Numbers19M → 8.07M tokens (reduction 2.37×)

Practical UseExpect roughly 2.4× lower token bills when using PROMPTOPTME-3B on similar MT evaluation workloads.

Evidence RefTable 1; Section 5

Prompt compression can preserve or improve measured metric quality on several language pairs.

NumbersEn-Ru τ: 0.4365→0.4455; En-De τ: 0.395→0.4065; Zh-En τ: 0.3692→0.3738

Practical UseYou can compress inputs without losing segment-level agreement with human judgments on evaluated language pairs.

Evidence RefTable 1; PROMPTOPTME-3B with GPT-4o lite

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Token Usage	19M → 8.07M	19M (GPT-4o ref)	2.37× reduction	16k test set (news domain)	Table 1: GPT-4o ref vs PROMPTOPTME-3B with GPT-4o lite	Section 5; Table 1
Accuracy	0.7736 (PROMPTOPTME-3B with GPT-4o lite)	0.7789 (GPT-4o ref)	≈ -0.0053 absolute (comparable)	16k test set	Table 1; Section 5	Table 1

What To Try In 7 Days

Measure token usage and quality of your current LLM MT metric on a small test slice.

Swap full prompt for the paper's simplified prompt and re-run a sample evaluation.

Train a small compressor (LoRA on LLaMA-3.2 3B) on MQM data and test token savings vs quality on 1k examples.

Optimization Features

Token Efficiency

Prompt CompressionContext CompressionToken Budgeting

Model Optimization

LoRA

Training Optimization

SFTPreference optimization (ORPO)

Inference Optimization

Prompt input compressionFixed simplified instruction template

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on machine translation using MQM-annotated WMT data; generalization to other NLG tasks is untested.

Preference data and evaluation rely on GPT-4o; learned compressor may encode evaluator-specific bias.

When Not To Use

When no error-span annotations (MQM-like) are available for supervision.

When strict adherence to full MQM error-typology is required for interpretability.

Failure Modes

Task-agnostic compressors can severely reduce metric quality (near-zero segment τ).

Compressed prompts may omit fine-grained MQM categories, hurting error-category analyses.

Core Entities

Models

PROMPTOPTME-3B (LLaMA-3.2 3B finetuned)PROMPTOPTME-1B (LLaMA-3.2 1B finetuned)GPT-4o (evaluator)GPT-4o mini (evaluator)LLaMA 3.2-90B (evaluator baseline)

Metrics

AccuracySegment-level Kendall's τ

Datasets

WMT Metrics MQM annotations (2020–2022)WMT22 Metrics Challenge test set (mentioned; 60k example scale)Test subset: ~16k news-domain examples (held-out)

Benchmarks

GEMBA-MQM (LLM-based MT metric)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.

Prompt compression can preserve or improve measured metric quality on several language pairs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding