7 papers found

Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

0.70
0.60
0.90
12

If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.

Key finding

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

Numbers: NaturalQuestions: up to +21.4% (Abstract; Table 1)

Compress long prompts into short natural-language 'Capsule Prompts' that cut cost and latency while keeping accuracy

0.60
0.50
0.80
1

Compress prompts into short natural-language pieces to cut API token costs and inference time without retraining target LLMs; this reduces cloud bills and speeds up batch processing.

Key finding

Capsule Prompts reduce input length by about 81.4% on evaluated tasks.

Numbers: 81.4% compression rate (reported in Table 1 and text)

Compress MT evaluation prompts to cut tokens ~2.37× while keeping evaluation quality

0.70
0.60
0.80
0

Compress evaluation prompts to cut LLM token bills by roughly 2.4× while keeping metric quality, making large-scale or repeated MT evaluations more affordable.

Key finding

Up to 2.37× reduction in input tokens for MT metric evaluation on the evaluated 16k test set.

Numbers: 19M8.07M tokens (reduction 2.37×)

Compress prompts by turning text into relation graphs, keeping readability and model utility

0.70
0.70
0.70
0

Compress prompts into readable information units to cut LLM API cost and latency while often improving downstream accuracy on evaluated tasks.

Key finding

On GSM8K-aug (task-agnostic, 2-shot) Prompt-SAW improved Exact Match (EM) versus best baseline by 10.1% while cutting prompt tokens by 34.9%.

Numbers: EM ↑ 10.1%; tokens 612399 (−34.9%)

Do multi-step math without long traces: refine compact latent anchors and stop when stable

0.60
0.60
0.70
0

AdaAnchor can cut output-token costs by over 90% and halve silent compute iterations on average. That lowers inference bandwidth and token billing for applications that only need final answers (e.g., calculators, automated graders) while preserving or improving accuracy in some cases.

Key finding

Adaptive halting sharply reduces average latent refinement steps compared to a fixed K budget.

Numbers: Avg steps reduced ~4861% (Table 2; adaptive 3.234.12 vs fixed 8)

Compress prompts by sampling attention-important tokens and sentences with a small RL policy

0.70
0.60
0.60
0

PIS lowers inference time and token costs by using attention-aware compression and a small RL policy, letting teams serve long-context LLM tasks faster while often preserving or improving accuracy.

Key finding

PIS improves task performance at the same compression rate compared to strong baselines.

Numbers: 15% relative performance improvement at equivalent compression ratios (paper claim)

Use the frozen LLM itself to compress over-limit prompts into 1/12 memory tokens

0.60
0.65
0.70
0

SelfCP lets you fit much longer context into an existing LLM by training a tiny adapter instead of re-training the whole model, cutting memory and inference costs while often improving output quality on summarization and QA.

Key finding

SelfCP compresses long prompts to 1/12 of original tokens and uses those memory tokens in generation.

Numbers: 12× compression ratio used by default