Overview
The method reuses a small LM to pick question-relevant tokens, reorders documents to reduce position bias, and recovers original subsequences; experiments across multiple public benchmarks back up claims, but the approach requires re-compression per question and adds overhead.
Citations12
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 90%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.
Who Should Care
Summary TLDR
LongLLMLingua is a question-aware, coarse-to-fine prompt compressor that uses a small language model to pick and keep tokens and documents most relevant to a question. It adds document reordering, dynamic compression ratios, and a subsequence-recovery step. Across multi-document QA and long-context benchmarks, compressed prompts often match or beat full prompts while sending 2x–6x fewer tokens. Examples: up to +21.4% accuracy on NaturalQuestions with ~4x fewer tokens, 94% estimated cost drop on LooGLE, and 1.4x–2.6x end-to-end latency speedups on large prompts.
Problem Statement
Long prompts raise API cost and latency, add noisy irrelevant text that hurts answers, and suffer LLM position bias (information in the middle is used poorly). The paper aims to compress prompts so LLMs see denser, better-positioned key info while saving cost and time.
Main Contribution
A question-aware coarse-to-fine token and document compression metric that ranks documents and tokens by their relevance to the question.
A document reordering strategy that puts higher-relevance documents near prompt edges to mitigate 'lost-in-the-middle' position bias.
Key Findings
Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.
Significant monetary savings on very long-context benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | +21.4% at 4x fewer tokens (best case) | Original long prompt | +21.4% | NaturalQuestions (ground-truth doc at 10th position) | Abstract; Table 1 | Abstract; Table 1 |
| Cost reduction (estimated) | ↓94.0% on LooGLE per 1,000 samples | Original prompt cost $93.6 → $5.6 | ↓94.0% | LooGLE | Table 9 (cost estimates) | Table 9 |
What To Try In 7 Days
Run LongLLMLingua (or a simple question-aware filter) before API calls on one long-context pipeline and compare cost/latency.
Enable document reordering: put highest-relevance docs near the prompt edges to test position-bias fixes.
Measure end-to-end latency and cost per 1k samples; validate that compressed answers meet your accuracy threshold.
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Requires re-compression per question, preventing caching of a single compressed context for multiple queries.
Adds extra compute (about twice the cost of LLMLingua) for the small-model scoring step, which can offset gains in some settings.
When Not To Use
When you need to reuse the same compressed context for many different questions (no caching support).
When the relation between context and question is extremely subtle and could be broken by coarse filtering.
Failure Modes
Small LM mis-ranks documents/tokens and drops crucial context, harming answers.
Subtle multi-hop dependencies span tokens removed at coarse level, reducing multi-hop accuracy.

