79 papers found

Question-aware prompt compression that speeds up LLMs and often improves accuracy on very long contexts

0.70
0.60
0.90
12

If you run LLMs on long documents, compressing prompts per question saves API cost and latency while often improving answer quality, so you can serve more queries at lower cost.

Key finding

Compressed prompts can improve accuracy vs. original long prompts on multi-document QA.

Numbers: NaturalQuestions: up to +21.4% (Abstract; Table 1)

Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

0.70
0.60
0.80
10

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Key finding

KV cache can be several times larger than model weights and becomes the memory bottleneck.

Numbers: KV cache 2.5 larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

Learned token-dropping that prunes up to 80% of context to speed and shrink autoregressive Transformers

0.70
0.60
0.75
8

You can cut inference memory and often double throughput on long prompts by fine-tuning a small module, reducing costs for batched long-context services.

Key finding

The fine-tuned model can remove ~80% of prior tokens with almost no perplexity loss.

Numbers: 80.35% sparsity → −0.085 avg perplexity (context=1000) vs dense

Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

0.60
0.60
0.70
7

LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.

Key finding

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

Numbers: Beats random 81.25%; median error change 13.70%; mean change 19.83% (Table 1).

Keyformer halves KV cache by keeping only 'key' tokens, doubling token throughput with no fine-tuning

0.80
0.65
0.80
6

Keyformer cuts memory traffic and latency for long-context generation without retraining, lowering inference cost and enabling higher throughput on existing GPU servers.

Key finding

Attention concentrates on a small subset of tokens ("key tokens").

Numbers: ≈90% of attention mass on ~40% of tokens (Fig.3b)

MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

0.80
0.60
0.80
3

MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.

Key finding

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Numbers: 5.02× compression (Table 1, LongBench average)

Compress long contexts into cached activations (beacons) to cut KV memory 8x and speed inference ~2x while keeping quality

0.70
0.60
0.70
2

Cuts serving memory by up to 8x and halves latency on long inputs while keeping task quality, letting teams process far larger documents at lower GPU cost.

Key finding

Compression preserves generation quality on evaluated long-context benchmarks.

Numbers: Single-Doc: Ours 34.9 vs Full-FT 34.8 (LongBench Table 1)

SWiM: a working-memory test exposes 'lost-in-the-middle' and fixes it with cheap medoid voting

0.60
0.60
0.50
2

SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.

Key finding

Long-context models commonly perform worse when the answer document appears in the middle of the context window.

Cut prompt cost by up to ~68% by keeping only query-relevant sentences and lightly compressing the rest

0.60
0.60
0.80
2

LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.

Key finding

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

Numbers: ArXiv N=4: prompt tokens 321->521, cost savings 37.29%, ROUGE-1 drop 0.3985->0.3844 (-0.0141)

ChunkAttention: share KV cache by chunking prompt prefixes to speed self-attention 3.2–4.8×

0.85
0.70
0.80
1

If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.

Key finding

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

Numbers: kernel speedup 3.24.8× (ns=1024..4096)

Compress KV cache by low-rank SVD on KV weight matrices with a layerwise progressive rule

0.80
0.60
0.80
1

LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.

Key finding

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Numbers: 55%–60% compression; avg perf drop <1%

LESS: add a tiny constant low-rank state to sparse KV caches and recover much of full-cache quality while cutting memory

0.70
0.60
0.70
1

LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.

Key finding

LESS recovers a substantial fraction of quality lost by sparse caching on summarization.

Numbers: 41.4% of Rouge-1 degradation recovered (Falcon 7B, CNN/DailyMail)

Cut KV cache by >54% and double throughput with layer-wise 'pyramid' selection

0.70
0.60
0.80
1

PyramidInfer lowers GPU memory needs for KV caches and raises throughput, letting you serve larger batches or fewer GPUs for chat workloads and cutting infrastructure cost per token.

Key finding

PyramidInfer halves KV cache and doubles throughput on LLaMA 2-13B.

Numbers: 2.24x throughput; 54.6% KV cache reduction (LLaMA2-13B, A100 80GB).

Compress KV cache by keeping semantic chunks (not single tokens) to save memory and speed up long-context LLMs

0.70
0.60
0.80
1

ChunkKV lowers GPU memory and speeds long-context LLM serving by keeping semantically coherent chunks and reusing indices across layers; this reduces infrastructure cost and improves latency-sensitive applications.

Key finding

Chunk-level compression preserves semantics and reduces accuracy loss versus token-level methods.

Numbers: up to +8.7% precision at same compression ratio (paper abstract)

Learn offline 'cheat-sheets' so a 4k LLaMA2 handles 128k tokens, cutting tokens and latency

0.75
0.60
0.80
1

LLoCO cuts token processing and GPU costs for long-document QA while improving accuracy and latency, letting teams serve very long documents without buying larger models or more GPUs.

Key finding

LLoCO raises average QA performance vs base LLaMA2-7B on evaluated long-doc tasks.

Numbers: Avg score 23.44 -> 30.67 (Table 1; +7.23 pts)

Compress KV cache per layer with a pyramid-shaped budget to cut memory while keeping long‑context performance

0.70
0.65
0.70
1

PyramidKV reduces GPU memory for long-context inference by large factors while keeping retrieval and QA performance, enabling RAG and few‑shot workflows on cheaper hardware.

Key finding

PyramidKV can match full‑KV accuracy in needle-in‑a‑haystack retrieval with tiny caches.

Numbers: LLaMA-3-70B, 8k context, KV=128 → FullKV 100.0% vs PyramidKV 100.0%

Use tiny fixed KV caches and learned 1‑D convolutions to compress thousands of tokens with low memory and near-full performance

0.70
0.60
0.70
0

LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.

Key finding

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Numbers: compressed 3,482 tokens into a 128-size KV cache; accuracy gain vs baseline 0.2791 (reported)

Compress KV caches up to ~80% with no engine changes by aligning per-head important tokens into shared 'composite' positions.

0.80
0.60
0.80
0

KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.

Key finding

KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.

Numbers: Avg max compression ratio under ϵ0=20% = 79.8%

Restore LLM context faster by saving hidden states (half the IO, much less recompute)

0.80
0.65
0.70
0

Stateful LLM services suffer long cold-start latencies when context is evicted. HCache reduces first-response latency and host storage needs by using a smaller, fast-to-project representation. That improves user experience for chatbots and RAG apps and lowers storage bill and I/O bottlenecks.

Key finding

Saving hidden states halves the I/O size compared with offloading KV cache.

Numbers: hidden states = 0.5× KV cache

Cut KV-cache memory 80–95% with a light fine-tune using low-rank channel shrinking

0.70
0.60
0.85
0

CSKV cuts KV-cache memory by ~80% (95% with QAT), enabling much longer context per GPU and lower serving costs with only a short fine-tune.

Key finding

CSKV reduces KV cache memory by about 80% while preserving long-context accuracy.

Numbers: 80% KV compression → Avg. accuracies ~0.900.94 on LongEval subsets (Table 1).

Use off-the-shelf LLMs plus arithmetic coding to losslessly compress gradients

0.40
0.80
0.60
0

LM-GC can cut gradient bytes by ~6%–17% losslessly, lowering network costs in federated or distributed training, but current runtime is slow and needs systems work before production use.

Key finding

LM-GC improves lossless compression vs. best baseline on evaluated datasets.

Numbers: 17.2% improvement (TinyImageNet vs FPZIP); 5.9% (CIFAR-10); 8.8% (MNIST)

Compute directly on 2-bit KV cache to cut network, memory and compute time for disaggregated LLMs

0.70
0.70
0.80
0

If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.

Key finding

KV transmission can be a major part of latency in disaggregated setups.

Numbers: KV transmission up to 42.2% of JCT (measured)