Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.
Summary TLDR
KVCompose reduces key-value (KV) cache memory while keeping accuracy high. It scores tokens by attention, lets each head pick its own important tokens, then aligns those picks into shared positions (composite tokens). A global allocator gives more slots to informative layers. Results on the Ruler-4096 benchmark show much higher robustness under aggressive compression than prior structured methods, and it works with standard inference engines (no custom kernels).
Problem Statement
KV caches grow with context length and depth, making long-context LLM inference memory-heavy and costly. Existing compression methods either break standard tensor layouts, need offline steps or custom kernels, or use rigid heuristics that lose accuracy under heavy compression.
Main Contribution
Attention-guided token scoring that estimates per-layer, per-head token importance from aggregated attention patterns.
Composite tokens: each head independently selects its important positions, then these selections are aligned into a shared per-layer sequence so standard KV tensor shapes stay intact.
Layer-adaptive global budget allocation that ranks composite-token importance across layers and gives more slots to layers with more informative tokens.
A practical pipeline compatible with existing inference engines (vLLM, Huggingface) that avoids custom attention kernels.
Key Findings
KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.
KVCompose preserves accuracy across compression levels better than prior structured methods.
Unstructured methods can get slightly higher accuracy but need custom kernels; KVCompose trades a small accuracy gap for deployability.
Aggregation choices and score stabilization matter for performance.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run KVCompose scoring on one model and dataset to measure achievable compression vs accuracy on your workload (use kvpress for attention patching).
Start with task-aware scoring and max-pooling over task tokens; compare accuracy-compression curves to your current eviction policy.
If you cannot change inference code, prioritize structured KVCompose; if you can modify engines, benchmark an unstructured variant (KVzip) for extra accuracy.
Optimization Features
Token Efficiency
- retains most informative tokens per head
- enables ~70–80% compression while meeting 10–20% loss tolerances on Ruler-4096 (paper experiments)
Infra Optimization
- reduces KV memory footprint and inference bandwidth proportionally to sequence dimension reduction
System Optimization
- preserves per-layer tensor sequence dimension (engine-compatible)
- no custom CUDA kernels required
Inference Optimization
- structured KV cache eviction
- composite tokens (per-head selection aligned into shared positions)
- layer-adaptive global budget allocation
- attention-guided token scoring
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to Ruler-4096 and three open models; real-world texts or other tasks may behave differently.
- Scoring depends on the task set T and aggregation choices; poor or unrepresentative T can degrade selections.
- Unstructured variants give higher accuracy but need custom kernels—KVCompose trades a small accuracy gap for deployability.
- Value-norm weighting is model-dependent and can hurt some models (LLaMA-3.1-8B showed degradation).
When Not To Use
- When you can change inference engine and want highest possible accuracy—unstructured KVzip may be better.
- Very short contexts where KV memory is not a bottleneck.
- When you cannot collect representative attention signals for scoring (no task set).
Failure Modes
- Misestimating token importance due to an unrepresentative task set T, leading to important-token eviction.
- Head score variability causing inconsistent composite alignment if the head-mean stabilization is omitted.
- Model-specific quirks: value-norm weighting can sometimes harm accuracy depending on the model.
- Edge cases where rigid tensor alignment still loses information compared to fully unstructured per-head storage.
Core Entities
Models
- LLaMA-3.1-8B
- Qwen2.5-7B-Instruct
- Qwen3-14B
Metrics
- Accuracy
- compression ratio r
- max compression at tolerance ϵ0
Datasets
- Ruler-4096
- kvpress
Benchmarks
- Ruler-4096

