Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

October 10, 20237 min

Overview

Decision SnapshotReady For Pilot

The method reuses a small LM to pick question-relevant tokens, reorders documents to reduce position bias, and recovers original subsequences; experiments across multiple public benchmarks back up claims, but the approach requires re-compression per question and adds overhead.

Citations12

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 70%

Novelty: 60%

Authors

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.

Who Should Care

Summary TLDR

LongLLMLingua is a question-aware, coarse-to-fine prompt compressor that uses a small language model to pick and keep tokens and documents most relevant to a question. It adds document reordering, dynamic compression ratios, and a subsequence-recovery step. Across multi-document QA and long-context benchmarks, compressed prompts often match or beat full prompts while sending 2x–6x fewer tokens. Examples: up to +21.4% accuracy on NaturalQuestions with ~4x fewer tokens, 94% estimated cost drop on LooGLE, and 1.4x–2.6x end-to-end latency speedups on large prompts.

Problem Statement

Long prompts raise API cost and latency, add noisy irrelevant text that hurts answers, and suffer LLM position bias (information in the middle is used poorly). The paper aims to compress prompts so LLMs see denser, better-positioned key info while saving cost and time.

Main Contribution

A question-aware coarse-to-fine token and document compression metric that ranks documents and tokens by their relevance to the question.

A document reordering strategy that puts higher-relevance documents near prompt edges to mitigate 'lost-in-the-middle' position bias.

Key Findings

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

NumbersNaturalQuestions: up to +21.4% (Abstract; Table 1)

Practical UseCompress prompts with LongLLMLingua before calling a large LLM — you can get noticeably better answers while sending far fewer tokens.

Evidence RefAbstract; Table 1

Significant monetary savings on very long-context benchmarks.

NumbersLooGLE cost drop ≈ 94% per 1,000 samples (Table 9)

Practical UseFor heavy long-context workloads, run LongLLMLingua to cut API bills dramatically, especially on datasets with extremely long prompts.

Evidence RefTable 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy+21.4% at 4x fewer tokens (best case)Original long prompt+21.4%NaturalQuestions (ground-truth doc at 10th position)Abstract; Table 1Abstract; Table 1
Cost reduction (estimated)94.0% on LooGLE per 1,000 samplesOriginal prompt cost $93.6$5.694.0%LooGLETable 9 (cost estimates)Table 9

What To Try In 7 Days

Run LongLLMLingua (or a simple question-aware filter) before API calls on one long-context pipeline and compare cost/latency.

Enable document reordering: put highest-relevance docs near the prompt edges to test position-bias fixes.

Measure end-to-end latency and cost per 1k samples; validate that compressed answers meet your accuracy threshold.

Optimization Features

Token Efficiency
Contrastive perplexity (question-conditioned) to score tokensBudget controller to allocate tokens to instruction/question/documents
System Optimization
Subsequence recovery to map compressed output back to original text
Inference Optimization
Token-level pruning using small LM perplexityCoarse-to-fine compression to reduce input tokensDocument reordering to reduce position biasDynamic per-document compression budgets

Reproducibility

Risks & Boundaries

Limitations

Requires re-compression per question, preventing caching of a single compressed context for multiple queries.

Adds extra compute (about twice the cost of LLMLingua) for the small-model scoring step, which can offset gains in some settings.

When Not To Use

When you need to reuse the same compressed context for many different questions (no caching support).

When the relation between context and question is extremely subtle and could be broken by coarse filtering.

Failure Modes

Small LM mis-ranks documents/tokens and drops crucial context, harming answers.

Subtle multi-hop dependencies span tokens removed at coarse level, reducing multi-hop accuracy.

Core Entities

Models

GPT-3.5-TurboLongChat-13B-16kLLaMA-2-7B-ChatGPT2-small (ablation)

Metrics

F1 / task scoreAccuracyEnd-to-end latency (sec)API inference cost ($ per 1k samples)

Datasets

NaturalQuestions (multi-doc QA)LongBenchZeroSCROLLSMuSiQueLooGLE

Benchmarks

NaturalQuestionsLongBenchZeroSCROLLSMuSiQueLooGLE