Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

October 10, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.9

Citation Count

12

Authors

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Links

Abstract / PDF

Why It Matters For Business

If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.

Summary TLDR

LongLLMLingua is a question-aware, coarse-to-fine prompt compressor that uses a small language model to pick and keep tokens and documents most relevant to a question. It adds document reordering, dynamic compression ratios, and a subsequence-recovery step. Across multi-document QA and long-context benchmarks, compressed prompts often match or beat full prompts while sending 2x–6x fewer tokens. Examples: up to +21.4% accuracy on NaturalQuestions with ~4x fewer tokens, 94% estimated cost drop on LooGLE, and 1.4x–2.6x end-to-end latency speedups on large prompts.

Problem Statement

Long prompts raise API cost and latency, add noisy irrelevant text that hurts answers, and suffer LLM position bias (information in the middle is used poorly). The paper aims to compress prompts so LLMs see denser, better-positioned key info while saving cost and time.

Main Contribution

A question-aware coarse-to-fine token and document compression metric that ranks documents and tokens by their relevance to the question.

A document reordering strategy that puts higher-relevance documents near prompt edges to mitigate 'lost-in-the-middle' position bias.

Dynamic per-document compression budgets so more relevant documents keep more tokens.

A subsequence recovery post-process that restores original token subsequences in model outputs to fix truncated entities.

Extensive evaluation across NaturalQuestions, LongBench, ZeroSCROLLS, MuSiQue, and LooGLE showing accuracy, cost, and latency benefits.

Key Findings

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

NumbersNaturalQuestions: up to +21.4% (Abstract; Table 1)

Significant monetary savings on very long-context benchmarks.

NumbersLooGLE cost drop ≈ 94% per 1,000 samples (Table 9)

End-to-end latency improves with higher compression ratios.

NumbersLatency speedups 1.4x–2.6x when compressing ~10k-token prompts at 2x–6x (Abstract; Tables 1–2)

Each proposed module contributes to the gains; removing any one reduces performance.

NumbersAblations show drops across tasks (Tables 3–4,7)

Results

Accuracy

Value+21.4% at 4x fewer tokens (best case)

BaselineOriginal long prompt

Cost reduction (estimated)

Value↓94.0% on LooGLE per 1,000 samples

BaselineOriginal prompt cost $93.6 → $5.6

Latency speedup

Value1.4x–2.6x end-to-end (varies by task & compression)

BaselineUncompressed prompt end-to-end latency

Average score on LongBench (2,000 token constraint)

ValueLongLLMLingua AVG = 48.8 vs Original AVG = 44.0 (3k→2k constraint)

BaselineOriginal prompt (10,295 tokens)

Who Should Care

What To Try In 7 Days

Run LongLLMLingua (or a simple question-aware filter) before API calls on one long-context pipeline and compare cost/latency.

Enable document reordering: put highest-relevance docs near the prompt edges to test position-bias fixes.

Measure end-to-end latency and cost per 1k samples; validate that compressed answers meet your accuracy threshold.

Optimization Features

Token Efficiency

  • Contrastive perplexity (question-conditioned) to score tokens
  • Budget controller to allocate tokens to instruction/question/documents

System Optimization

  • Subsequence recovery to map compressed output back to original text

Inference Optimization

  • Token-level pruning using small LM perplexity
  • Coarse-to-fine compression to reduce input tokens
  • Document reordering to reduce position bias
  • Dynamic per-document compression budgets

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires re-compression per question, preventing caching of a single compressed context for multiple queries.
  • Adds extra compute (about twice the cost of LLMLingua) for the small-model scoring step, which can offset gains in some settings.
  • May lose subtle or complex context-question relationships because coarse question-aware ranking is not perfect.

When Not To Use

  • When you need to reuse the same compressed context for many different questions (no caching support).
  • When the relation between context and question is extremely subtle and could be broken by coarse filtering.
  • When the extra small-LM compute overhead outweighs savings for short prompts.

Failure Modes

  • Small LM mis-ranks documents/tokens and drops crucial context, harming answers.
  • Subtle multi-hop dependencies span tokens removed at coarse level, reducing multi-hop accuracy.
  • Hallucination by the small LM increases without the restrictive prompt, as shown in ablations.

Core Entities

Models

  • GPT-3.5-Turbo
  • LongChat-13B-16k
  • LLaMA-2-7B-Chat
  • GPT2-small (ablation)

Metrics

  • F1 / task score
  • Accuracy
  • End-to-end latency (sec)
  • API inference cost ($ per 1k samples)

Datasets

  • NaturalQuestions (multi-doc QA)
  • LongBench
  • ZeroSCROLLS
  • MuSiQue
  • LooGLE

Benchmarks

  • NaturalQuestions
  • LongBench
  • ZeroSCROLLS
  • MuSiQue
  • LooGLE