Overview
Evaluated on three open models and a standard long-context benchmark (Ruler-4096). Results include ablations. Claims about engine compatibility are supported by the structured format design, and quantitative gains are shown in multiple tables. External validity outside Ruler-4096 and different workloads requires in-sit
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.
Who Should Care
Summary TLDR
KVCompose reduces key-value (KV) cache memory while keeping accuracy high. It scores tokens by attention, lets each head pick its own important tokens, then aligns those picks into shared positions (composite tokens). A global allocator gives more slots to informative layers. Results on the Ruler-4096 benchmark show much higher robustness under aggressive compression than prior structured methods, and it works with standard inference engines (no custom kernels).
Problem Statement
KV caches grow with context length and depth, making long-context LLM inference memory-heavy and costly. Existing compression methods either break standard tensor layouts, need offline steps or custom kernels, or use rigid heuristics that lose accuracy under heavy compression.
Main Contribution
Attention-guided token scoring that estimates per-layer, per-head token importance from aggregated attention patterns.
Composite tokens: each head independently selects its important positions, then these selections are aligned into a shared per-layer sequence so standard KV tensor shapes stay intact.
Key Findings
KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.
KVCompose preserves accuracy across compression levels better than prior structured methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Avg 79.8% | TOVA avg 61.1% | +18.7 pp | Ruler-4096 (avg over 3 models, task-aware & task-agnostic mixes) | Table 2 (avg perf., ϵ0=20%) | Table 2 |
| Accuracy | Avg 70.1% | DuoAttention avg 54.0% | +16.1 pp | Ruler-4096 (avg over 3 models) | Table 2 (ϵ0=10%) | Table 2 |
What To Try In 7 Days
Run KVCompose scoring on one model and dataset to measure achievable compression vs accuracy on your workload (use kvpress for attention patching).
Start with task-aware scoring and max-pooling over task tokens; compare accuracy-compression curves to your current eviction policy.
If you cannot change inference code, prioritize structured KVCompose; if you can modify engines, benchmark an unstructured variant (KVzip) for extra accuracy.
Optimization Features
Token Efficiency
retains most informative tokens per head
enables ~70–80% compression while meeting 10–20% loss tolerances on Ruler-4096 (paper experiments)
Infra Optimization
reduces KV memory footprint and inference bandwidth proportionally to sequence dimension reduction
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation limited to Ruler-4096 and three open models; real-world texts or other tasks may behave differently.
Scoring depends on the task set T and aggregation choices; poor or unrepresentative T can degrade selections.
When Not To Use
When you can change inference engine and want highest possible accuracy—unstructured KVzip may be better.
Very short contexts where KV memory is not a bottleneck.
Failure Modes
Misestimating token importance due to an unrepresentative task set T, leading to important-token eviction.
Head score variability causing inconsistent composite alignment if the head-mean stabilization is omitted.

