Compress KV caches up to ~80% with no engine changes by aligning per-head important tokens into shared 'composite' positions.

Overview

Decision SnapshotNeeds Validation

Evaluated on three open models and a standard long-context benchmark (Ruler-4096). Results include ablations. Claims about engine compatibility are supported by the structured format design, and quantitative gains are shown in multiple tables. External validity outside Ruler-4096 and different workloads requires in-sit

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

Links

Abstract / PDF / Data

Why It Matters For Business

KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

KVCompose reduces key-value (KV) cache memory while keeping accuracy high. It scores tokens by attention, lets each head pick its own important tokens, then aligns those picks into shared positions (composite tokens). A global allocator gives more slots to informative layers. Results on the Ruler-4096 benchmark show much higher robustness under aggressive compression than prior structured methods, and it works with standard inference engines (no custom kernels).

Problem Statement

KV caches grow with context length and depth, making long-context LLM inference memory-heavy and costly. Existing compression methods either break standard tensor layouts, need offline steps or custom kernels, or use rigid heuristics that lose accuracy under heavy compression.

Main Contribution

Attention-guided token scoring that estimates per-layer, per-head token importance from aggregated attention patterns.

Composite tokens: each head independently selects its important positions, then these selections are aligned into a shared per-layer sequence so standard KV tensor shapes stay intact.

Key Findings

KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.

NumbersAvg max compression ratio under ϵ0=20% = 79.8%

Practical UseYou can shrink KV storage ~4–5× on average (keep ≤20% task loss on Ruler-4096) without changing inference engines.

Evidence RefTable 2 (avg perf., ϵ0=20%)

KVCompose preserves accuracy across compression levels better than prior structured methods.

NumbersAverage AUC (robustness across ratios) = 82.3 (KVCompose) vs 73.4 (TOVA)

Practical UseExpect gentler accuracy drop as you increase compression. Useful when you need predictable degradation over many compression choices.

Evidence RefTable 3 (AUC, avg perf.)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Avg 79.8%	TOVA avg 61.1%	+18.7 pp	Ruler-4096 (avg over 3 models, task-aware & task-agnostic mixes)	Table 2 (avg perf., ϵ0=20%)	Table 2
Accuracy	Avg 70.1%	DuoAttention avg 54.0%	+16.1 pp	Ruler-4096 (avg over 3 models)	Table 2 (ϵ0=10%)	Table 2

What To Try In 7 Days

Run KVCompose scoring on one model and dataset to measure achievable compression vs accuracy on your workload (use kvpress for attention patching).

Start with task-aware scoring and max-pooling over task tokens; compare accuracy-compression curves to your current eviction policy.

If you cannot change inference code, prioritize structured KVCompose; if you can modify engines, benchmark an unstructured variant (KVzip) for extra accuracy.

Optimization Features

Token Efficiency

retains most informative tokens per head

enables ~70–80% compression while meeting 10–20% loss tolerances on Ruler-4096 (paper experiments)

Infra Optimization

reduces KV memory footprint and inference bandwidth proportionally to sequence dimension reduction

System Optimization

preserves per-layer tensor sequence dimension (engine-compatible)no custom CUDA kernels required

Inference Optimization

structured KV cache evictioncomposite tokens (per-head selection aligned into shared positions)layer-adaptive global budget allocationattention-guided token scoring

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/NVIDIA/kvpress

Risks & Boundaries

Limitations

Evaluation limited to Ruler-4096 and three open models; real-world texts or other tasks may behave differently.

Scoring depends on the task set T and aggregation choices; poor or unrepresentative T can degrade selections.

When Not To Use

When you can change inference engine and want highest possible accuracy—unstructured KVzip may be better.

Very short contexts where KV memory is not a bottleneck.

Failure Modes

Misestimating token importance due to an unrepresentative task set T, leading to important-token eviction.

Head score variability causing inconsistent composite alignment if the head-mean stabilization is omitted.

Core Entities

Models

LLaMA-3.1-8BQwen2.5-7B-InstructQwen3-14B

Metrics

Accuracycompression ratio rmax compression at tolerance ϵ0

Datasets

Ruler-4096kvpress

Benchmarks

Ruler-4096

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.

KVCompose preserves accuracy across compression levels better than prior structured methods.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding