Compress KV caches up to ~80% with no engine changes by aligning per-head important tokens into shared 'composite' positions.

September 5, 20257 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

Links

Abstract / PDF

Why It Matters For Business

KV cache is a major memory and bandwidth cost for long-context LLM serving. KVCompose cuts cache size substantially while keeping accuracy predictable and without engine changes—so teams can reduce infra costs or fit longer contexts on the same hardware.

Summary TLDR

KVCompose reduces key-value (KV) cache memory while keeping accuracy high. It scores tokens by attention, lets each head pick its own important tokens, then aligns those picks into shared positions (composite tokens). A global allocator gives more slots to informative layers. Results on the Ruler-4096 benchmark show much higher robustness under aggressive compression than prior structured methods, and it works with standard inference engines (no custom kernels).

Problem Statement

KV caches grow with context length and depth, making long-context LLM inference memory-heavy and costly. Existing compression methods either break standard tensor layouts, need offline steps or custom kernels, or use rigid heuristics that lose accuracy under heavy compression.

Main Contribution

Attention-guided token scoring that estimates per-layer, per-head token importance from aggregated attention patterns.

Composite tokens: each head independently selects its important positions, then these selections are aligned into a shared per-layer sequence so standard KV tensor shapes stay intact.

Layer-adaptive global budget allocation that ranks composite-token importance across layers and gives more slots to layers with more informative tokens.

A practical pipeline compatible with existing inference engines (vLLM, Huggingface) that avoids custom attention kernels.

Key Findings

KVCompose reaches higher maximum compression ratios while staying within a fixed accuracy loss tolerance.

NumbersAvg max compression ratio under ϵ0=20% = 79.8%

KVCompose preserves accuracy across compression levels better than prior structured methods.

NumbersAverage AUC (robustness across ratios) = 82.3 (KVCompose) vs 73.4 (TOVA)

Unstructured methods can get slightly higher accuracy but need custom kernels; KVCompose trades a small accuracy gap for deployability.

NumbersUnstructured KVzip avg = 89.5 vs Unstructured KVCompose = 88.4 (both unstructured experiments)

Aggregation choices and score stabilization matter for performance.

NumbersMax-pooling for task aggregation improves accuracy; adding head-mean improves robustness (ablation)

Results

Accuracy

ValueAvg 79.8%

BaselineTOVA avg 61.1%

Accuracy

ValueAvg 70.1%

BaselineDuoAttention avg 54.0%

Accuracy

Value82.3

BaselineTOVA 73.4

Accuracy

ValueKVzip 89.5, Unstructured KVCompose 88.4

BaselineAda-variants ~79.2

Who Should Care

What To Try In 7 Days

Run KVCompose scoring on one model and dataset to measure achievable compression vs accuracy on your workload (use kvpress for attention patching).

Start with task-aware scoring and max-pooling over task tokens; compare accuracy-compression curves to your current eviction policy.

If you cannot change inference code, prioritize structured KVCompose; if you can modify engines, benchmark an unstructured variant (KVzip) for extra accuracy.

Optimization Features

Token Efficiency

  • retains most informative tokens per head
  • enables ~70–80% compression while meeting 10–20% loss tolerances on Ruler-4096 (paper experiments)

Infra Optimization

  • reduces KV memory footprint and inference bandwidth proportionally to sequence dimension reduction

System Optimization

  • preserves per-layer tensor sequence dimension (engine-compatible)
  • no custom CUDA kernels required

Inference Optimization

  • structured KV cache eviction
  • composite tokens (per-head selection aligned into shared positions)
  • layer-adaptive global budget allocation
  • attention-guided token scoring

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to Ruler-4096 and three open models; real-world texts or other tasks may behave differently.
  • Scoring depends on the task set T and aggregation choices; poor or unrepresentative T can degrade selections.
  • Unstructured variants give higher accuracy but need custom kernels—KVCompose trades a small accuracy gap for deployability.
  • Value-norm weighting is model-dependent and can hurt some models (LLaMA-3.1-8B showed degradation).

When Not To Use

  • When you can change inference engine and want highest possible accuracy—unstructured KVzip may be better.
  • Very short contexts where KV memory is not a bottleneck.
  • When you cannot collect representative attention signals for scoring (no task set).

Failure Modes

  • Misestimating token importance due to an unrepresentative task set T, leading to important-token eviction.
  • Head score variability causing inconsistent composite alignment if the head-mean stabilization is omitted.
  • Model-specific quirks: value-norm weighting can sometimes harm accuracy depending on the model.
  • Edge cases where rigid tensor alignment still loses information compared to fully unstructured per-head storage.

Core Entities

Models

  • LLaMA-3.1-8B
  • Qwen2.5-7B-Instruct
  • Qwen3-14B

Metrics

  • Accuracy
  • compression ratio r
  • max compression at tolerance ϵ0

Datasets

  • Ruler-4096
  • kvpress

Benchmarks

  • Ruler-4096