Compress KV caches up to ~80% with no engine changes by aligning per-head important tokens into shared 'composite' positions.

September 5, 20257 min

Overview

Decision SnapshotNeeds Validation

Evaluated on three open models and a standard long-context benchmark (Ruler-4096). Results include ablations. Claims about engine compatibility are supported by the structured format design, and quantitative gains are shown in multiple tables. External validity outside Ruler-4096 and different workloads requires in-sit

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

Links

Abstract / PDF / Data

Why It Matters For Business

KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.

Who Should Care

Summary TLDR

KVCompose reduces key-value (KV) cache memory while keeping accuracy high. It scores tokens by attention, lets each head pick its own important tokens, then aligns those picks into shared positions (composite tokens). A global allocator gives more slots to informative layers. Results on the Ruler-4096 benchmark show much higher robustness under aggressive compression than prior structured methods, and it works with standard inference engines (no custom kernels).

Problem Statement

KV caches grow with context length and depth, making long-context LLM inference memory-heavy and costly. Existing compression methods either break standard tensor layouts, need offline steps or custom kernels, or use rigid heuristics that lose accuracy under heavy compression.

Main Contribution

Attention-guided token scoring that estimates per-layer, per-head token importance from aggregated attention patterns.

Composite tokens: each head independently selects its important positions, then these selections are aligned into a shared per-layer sequence so standard KV tensor shapes stay intact.

Key Findings

KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.

NumbersAvg max compression ratio under ϵ0=20% = 79.8%

Practical UseYou can shrink KV storage ~4–5× on average (keep ≤20% task loss on Ruler-4096) without changing inference engines.

Evidence RefTable 2 (avg perf., ϵ0=20%)

KVCompose preserves accuracy across compression levels better than prior structured methods.

NumbersAverage AUC (robustness across ratios) = 82.3 (KVCompose) vs 73.4 (TOVA)

Practical UseExpect gentler accuracy drop as you increase compression. Useful when you need predictable degradation over many compression choices.

Evidence RefTable 3 (AUC, avg perf.)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyAvg 79.8%TOVA avg 61.1%+18.7 ppRuler-4096 (avg over 3 models, task-aware & task-agnostic mixes)Table 2 (avg perf., ϵ0=20%)Table 2
AccuracyAvg 70.1%DuoAttention avg 54.0%+16.1 ppRuler-4096 (avg over 3 models)Table 2 (ϵ0=10%)Table 2

What To Try In 7 Days

Run KVCompose scoring on one model and dataset to measure achievable compression vs accuracy on your workload (use kvpress for attention patching).

Start with task-aware scoring and max-pooling over task tokens; compare accuracy-compression curves to your current eviction policy.

If you cannot change inference code, prioritize structured KVCompose; if you can modify engines, benchmark an unstructured variant (KVzip) for extra accuracy.

Optimization Features

Token Efficiency

retains most informative tokens per head

enables ~70–80% compression while meeting 10–20% loss tolerances on Ruler-4096 (paper experiments)

Infra Optimization

reduces KV memory footprint and inference bandwidth proportionally to sequence dimension reduction

System Optimization
preserves per-layer tensor sequence dimension (engine-compatible)no custom CUDA kernels required
Inference Optimization
structured KV cache evictioncomposite tokens (per-head selection aligned into shared positions)layer-adaptive global budget allocationattention-guided token scoring

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation limited to Ruler-4096 and three open models; real-world texts or other tasks may behave differently.

Scoring depends on the task set T and aggregation choices; poor or unrepresentative T can degrade selections.

When Not To Use

When you can change inference engine and want highest possible accuracy—unstructured KVzip may be better.

Very short contexts where KV memory is not a bottleneck.

Failure Modes

Misestimating token importance due to an unrepresentative task set T, leading to important-token eviction.

Head score variability causing inconsistent composite alignment if the head-mean stabilization is omitted.

Core Entities

Models

LLaMA-3.1-8BQwen2.5-7B-InstructQwen3-14B

Metrics

Accuracycompression ratio rmax compression at tolerance ϵ0

Datasets

Ruler-4096kvpress

Benchmarks

Ruler-4096