Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

December 20, 20246 min

Overview

Decision SnapshotReady For Pilot

The method shows strong empirical gains on a 16k MT test set and uses standard components (LoRA, ORPO), but is validated only on MT and on preferences collected with GPT-4o, so wider production readiness requires more cross-model and task checks.

Citations0

Evidence Strength0.80

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Daniil Larionov, Steffen Eger

Links

Abstract / PDF

Why It Matters For Business

Compress evaluation prompts to cut LLM token bills by roughly 2.4× while keeping metric quality, making large-scale or repeated MT evaluations more affordable.

Who Should Care

Summary TLDR

PromptOptMe trains a small model to compress the inputs of LLM-based MT evaluators (like GEMBA-MQM) in two stages: supervised fine-tuning to preserve error spans, then preference optimization (ORPO) to prefer compressions whose LLM scores match uncompressed prompts. On a 16k test set the best setup cuts token use up to 2.37× (19M → ≈8M tokens) while keeping system-level accuracy and often improving segment Kendall τ on evaluated language pairs. Task-agnostic compressors (LLMLingua-2) break metric quality, showing the need to preserve error spans.

Problem Statement

LLM-based MT metrics give strong human alignment but prompts are long and expensive (≈1100–1200 tokens per example). This paper asks: can a small model compress prompt inputs to cut token costs while preserving evaluation quality?

Main Contribution

A two-stage prompt compression pipeline: supervised fine-tuning to preserve MQM error spans, then ORPO preference optimization to prefer compressions whose LLM scores match uncompressed inputs.

An MT-focused compression model (PROMPTOPTME) based on LLaMA-3.2 1B/3B that reduces token use up to 2.37× on a 16k test set while retaining or improving evaluated metric quality.

Key Findings

Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.

Numbers19M8.07M tokens (reduction 2.37×)

Practical UseExpect roughly 2.4× lower token bills when using PROMPTOPTME-3B on similar MT evaluation workloads.

Evidence RefTable 1; Section 5

Prompt compression can preserve or improve measured metric quality on several language pairs.

NumbersEn-Ru τ: 0.43650.4455; En-De τ: 0.3950.4065; Zh-En τ: 0.36920.3738

Practical UseYou can compress inputs without losing segment-level agreement with human judgments on evaluated language pairs.

Evidence RefTable 1; PROMPTOPTME-3B with GPT-4o lite

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Token Usage19M8.07M19M (GPT-4o ref)2.37× reduction16k test set (news domain)Table 1: GPT-4o ref vs PROMPTOPTME-3B with GPT-4o liteSection 5; Table 1
Accuracy0.7736 (PROMPTOPTME-3B with GPT-4o lite)0.7789 (GPT-4o ref)≈ -0.0053 absolute (comparable)16k test setTable 1; Section 5Table 1

What To Try In 7 Days

Measure token usage and quality of your current LLM MT metric on a small test slice.

Swap full prompt for the paper's simplified prompt and re-run a sample evaluation.

Train a small compressor (LoRA on LLaMA-3.2 3B) on MQM data and test token savings vs quality on 1k examples.

Optimization Features

Token Efficiency
Prompt CompressionContext CompressionToken Budgeting
Model Optimization
LoRA
Training Optimization
SFTPreference optimization (ORPO)
Inference Optimization
Prompt input compressionFixed simplified instruction template

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on machine translation using MQM-annotated WMT data; generalization to other NLG tasks is untested.

Preference data and evaluation rely on GPT-4o; learned compressor may encode evaluator-specific bias.

When Not To Use

When no error-span annotations (MQM-like) are available for supervision.

When strict adherence to full MQM error-typology is required for interpretability.

Failure Modes

Task-agnostic compressors can severely reduce metric quality (near-zero segment τ).

Compressed prompts may omit fine-grained MQM categories, hurting error-category analyses.

Core Entities

Models

PROMPTOPTME-3B (LLaMA-3.2 3B finetuned)PROMPTOPTME-1B (LLaMA-3.2 1B finetuned)GPT-4o (evaluator)GPT-4o mini (evaluator)LLaMA 3.2-90B (evaluator baseline)

Metrics

AccuracySegment-level Kendall's τ

Datasets

WMT Metrics MQM annotations (2020–2022)WMT22 Metrics Challenge test set (mentioned; 60k example scale)Test subset: ~16k news-domain examples (held-out)

Benchmarks

GEMBA-MQM (LLM-based MT metric)