Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

Overview

Decision SnapshotReady For Pilot

The method reuses a small LM to pick question-relevant tokens, reorders documents to reduce position bias, and recovers original subsequences; experiments across multiple public benchmarks back up claims, but the approach requires re-compression per question and adds overhead.

Citations12

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 70%

Novelty: 60%

Authors

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

LongLLMLingua is a question-aware, coarse-to-fine prompt compressor that uses a small language model to pick and keep tokens and documents most relevant to a question. It adds document reordering, dynamic compression ratios, and a subsequence-recovery step. Across multi-document QA and long-context benchmarks, compressed prompts often match or beat full prompts while sending 2x–6x fewer tokens. Examples: up to +21.4% accuracy on NaturalQuestions with ~4x fewer tokens, 94% estimated cost drop on LooGLE, and 1.4x–2.6x end-to-end latency speedups on large prompts.

Problem Statement

Long prompts raise API cost and latency, add noisy irrelevant text that hurts answers, and suffer LLM position bias (information in the middle is used poorly). The paper aims to compress prompts so LLMs see denser, better-positioned key info while saving cost and time.

Main Contribution

A question-aware coarse-to-fine token and document compression metric that ranks documents and tokens by their relevance to the question.

A document reordering strategy that puts higher-relevance documents near prompt edges to mitigate 'lost-in-the-middle' position bias.

Key Findings

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

NumbersNaturalQuestions: up to +21.4% (Abstract; Table 1)

Practical UseCompress prompts with LongLLMLingua before calling a large LLM — you can get noticeably better answers while sending far fewer tokens.

Evidence RefAbstract; Table 1

Significant monetary savings on very long-context benchmarks.

NumbersLooGLE cost drop ≈ 94% per 1,000 samples (Table 9)

Practical UseFor heavy long-context workloads, run LongLLMLingua to cut API bills dramatically, especially on datasets with extremely long prompts.

Evidence RefTable 9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	+21.4% at 4x fewer tokens (best case)	Original long prompt	+21.4%	NaturalQuestions (ground-truth doc at 10th position)	Abstract; Table 1	Abstract; Table 1
Cost reduction (estimated)	↓94.0% on LooGLE per 1,000 samples	Original prompt cost $93.6 → $5.6	↓94.0%	LooGLE	Table 9 (cost estimates)	Table 9

What To Try In 7 Days

Run LongLLMLingua (or a simple question-aware filter) before API calls on one long-context pipeline and compare cost/latency.

Enable document reordering: put highest-relevance docs near the prompt edges to test position-bias fixes.

Measure end-to-end latency and cost per 1k samples; validate that compressed answers meet your accuracy threshold.

Optimization Features

Token Efficiency

Contrastive perplexity (question-conditioned) to score tokensBudget controller to allocate tokens to instruction/question/documents

System Optimization

Subsequence recovery to map compressed output back to original text

Inference Optimization

Token-level pruning using small LM perplexityCoarse-to-fine compression to reduce input tokensDocument reordering to reduce position biasDynamic per-document compression budgets

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://aka.ms/LongLLMLingua

Data URLs

https://github.com/nelson-liu/lost-in-the-middle (NaturalQuestions setup)https://github.com/THUDM/LongBench https://www.zero.scrolls-benchmark.com/https://github.com/stonybrooknlp/musique https://github.com/bigai-nlco/LooGLE

Risks & Boundaries

Limitations

Requires re-compression per question, preventing caching of a single compressed context for multiple queries.

Adds extra compute (about twice the cost of LLMLingua) for the small-model scoring step, which can offset gains in some settings.

When Not To Use

When you need to reuse the same compressed context for many different questions (no caching support).

When the relation between context and question is extremely subtle and could be broken by coarse filtering.

Failure Modes

Small LM mis-ranks documents/tokens and drops crucial context, harming answers.

Subtle multi-hop dependencies span tokens removed at coarse level, reducing multi-hop accuracy.

Core Entities

Models

GPT-3.5-TurboLongChat-13B-16kLLaMA-2-7B-ChatGPT2-small (ablation)

Metrics

F1 / task scoreAccuracyEnd-to-end latency (sec)API inference cost ($ per 1k samples)

Datasets

NaturalQuestions (multi-doc QA)LongBenchZeroSCROLLSMuSiQueLooGLE

Benchmarks

NaturalQuestionsLongBenchZeroSCROLLSMuSiQueLooGLE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

Significant monetary savings on very long-context benchmarks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Do multi-step math without long traces: refine compact latent anchors and stop when stable

Key finding

Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

Key finding

Compress prompts by sampling attention-important tokens and sentences with a small RL policy

Key finding

Compress prompts by turning text into relation graphs, keeping readability and model utility

Key finding

Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

Key finding