Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.9
Citation Count
12
Why It Matters For Business
If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.
Summary TLDR
LongLLMLingua is a question-aware, coarse-to-fine prompt compressor that uses a small language model to pick and keep tokens and documents most relevant to a question. It adds document reordering, dynamic compression ratios, and a subsequence-recovery step. Across multi-document QA and long-context benchmarks, compressed prompts often match or beat full prompts while sending 2x–6x fewer tokens. Examples: up to +21.4% accuracy on NaturalQuestions with ~4x fewer tokens, 94% estimated cost drop on LooGLE, and 1.4x–2.6x end-to-end latency speedups on large prompts.
Problem Statement
Long prompts raise API cost and latency, add noisy irrelevant text that hurts answers, and suffer LLM position bias (information in the middle is used poorly). The paper aims to compress prompts so LLMs see denser, better-positioned key info while saving cost and time.
Main Contribution
A question-aware coarse-to-fine token and document compression metric that ranks documents and tokens by their relevance to the question.
A document reordering strategy that puts higher-relevance documents near prompt edges to mitigate 'lost-in-the-middle' position bias.
Dynamic per-document compression budgets so more relevant documents keep more tokens.
A subsequence recovery post-process that restores original token subsequences in model outputs to fix truncated entities.
Extensive evaluation across NaturalQuestions, LongBench, ZeroSCROLLS, MuSiQue, and LooGLE showing accuracy, cost, and latency benefits.
Key Findings
Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.
Significant monetary savings on very long-context benchmarks.
End-to-end latency improves with higher compression ratios.
Each proposed module contributes to the gains; removing any one reduces performance.
Results
Accuracy
Cost reduction (estimated)
Latency speedup
Average score on LongBench (2,000 token constraint)
Who Should Care
What To Try In 7 Days
Run LongLLMLingua (or a simple question-aware filter) before API calls on one long-context pipeline and compare cost/latency.
Enable document reordering: put highest-relevance docs near the prompt edges to test position-bias fixes.
Measure end-to-end latency and cost per 1k samples; validate that compressed answers meet your accuracy threshold.
Optimization Features
Token Efficiency
- Contrastive perplexity (question-conditioned) to score tokens
- Budget controller to allocate tokens to instruction/question/documents
System Optimization
- Subsequence recovery to map compressed output back to original text
Inference Optimization
- Token-level pruning using small LM perplexity
- Coarse-to-fine compression to reduce input tokens
- Document reordering to reduce position bias
- Dynamic per-document compression budgets
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires re-compression per question, preventing caching of a single compressed context for multiple queries.
- Adds extra compute (about twice the cost of LLMLingua) for the small-model scoring step, which can offset gains in some settings.
- May lose subtle or complex context-question relationships because coarse question-aware ranking is not perfect.
When Not To Use
- When you need to reuse the same compressed context for many different questions (no caching support).
- When the relation between context and question is extremely subtle and could be broken by coarse filtering.
- When the extra small-LM compute overhead outweighs savings for short prompts.
Failure Modes
- Small LM mis-ranks documents/tokens and drops crucial context, harming answers.
- Subtle multi-hop dependencies span tokens removed at coarse level, reducing multi-hop accuracy.
- Hallucination by the small LM increases without the restrictive prompt, as shown in ablations.
Core Entities
Models
- GPT-3.5-Turbo
- LongChat-13B-16k
- LLaMA-2-7B-Chat
- GPT2-small (ablation)
Metrics
- F1 / task score
- Accuracy
- End-to-end latency (sec)
- API inference cost ($ per 1k samples)
Datasets
- NaturalQuestions (multi-doc QA)
- LongBench
- ZeroSCROLLS
- MuSiQue
- LooGLE
Benchmarks
- NaturalQuestions
- LongBench
- ZeroSCROLLS
- MuSiQue
- LooGLE

