Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Aggressive KV compression can save large memory but induces a sharp risk of hallucination; production systems must preserve routing paths, not just token counts.
Summary TLDR
The paper reframes key-value (KV) cache compression as a structural disturbance of attention routing, not just memory removal. Using controlled synthetic datasets and new metrics (Global Eviction Ratio, head consensus), the authors show that: moderate pruning often leaves accuracy intact despite degraded internal representations; aggressive pruning (~90% removal) triggers a sharp hallucination cliff tied to global eviction of answer tokens; and different model families (LLaMA vs Qwen) allocate routing across depth differently, producing distinct robustness profiles. Practical takeaway: safe KV compression must preserve routing paths, not just token counts.
Problem Statement
KV cache compression methods report large memory savings, but common evaluations focus on aggregate accuracy and ignore whether compressed caches still allow the model to route evidence to the decoder. The paper asks: does compression merely remove redundant storage, or does it break the token-level routing paths that attention needs to access evidence during generation?
Main Contribution
A physics-inspired framing that treats KV compression as a perturbation to attention-based routing, distinguishing storage, accessibility, and utilization.
A controlled synthetic dataset suite (Base, Knowledge manipulation, Multi presence/entity, Coreference, Long context, Hops) designed to probe routing-sensitive failures.
New structural metrics: Global Eviction Ratio (GER) for task-aware route deletion and head-level consensus for routing flexibility.
Empirical discovery of two failure modes—representational erasure and representational rigidity—and a universal hallucination "safety cliff" near ~90% compression across models.
Layerwise analysis showing architectural differences: LLaMA exhibits early consensus then late diversification, Qwen shows early exploration then late consolidation, implying architecture-aware pruning is necessary.
Key Findings
Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.
There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.
Two distinct failure modes occur: (i) representational erasure when answer tokens are evicted, and (ii) representational rigidity when tokens survive but head consensus collapses routing flexibility.
Results
Base F1 (example)
Compression safety cliff
Task robustness range
Who Should Care
What To Try In 7 Days
Measure GER on your workloads to estimate route deletion risk under your pruning policy
Run question-aware and question-agnostic pruning on representative queries to compare robustness
Prefer head- and depth-aware pruning (preserve cross-head diversity) before pushing ≥80% KV reduction
Optimization Features
Token Efficiency
- context compression via eviction and chunking
Infra Optimization
- KVPress pipeline for unified compression experiments
Model Optimization
- architecture-aware pruning
System Optimization
- IO-aware kernels (FlashAttention) mentioned for efficiency
Training Optimization
- training objectives to encourage cross-head redundancy (proposed future work)
Inference Optimization
- KV cache pruning (AdaKV, FINCH)
- KV quantization
Reproducibility
License
- CC-BY 4.0
Code Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic datasets trade realism for causal control; real-world text may behave differently.
- Experiments focus on LLaMA and Qwen families and 3–14B scales; results may vary for much larger or different architectures.
- No public release of the synthetic dataset or experiment code is stated, limiting direct replication.
- Analysis omits systems with external memory or retrieval chains which may change reachability dynamics.
When Not To Use
- When you cannot tolerate any hallucination risk at extreme compression (e.g., legal or safety-critical text generation)
- If your workload relies heavily on multi-hop relational reasoning or distributed bridging tokens
- When using models or architectures not evaluated here without additional validation
Failure Modes
- Representational erasure: answer tokens are globally evicted across heads
- Representational rigidity: tokens survive but head-level consensus collapses rerouting
- Query-conditioned overcommitment: question-aware pruning can force premature commitments and increase confident errors
Core Entities
Models
- LLaMA-3.2 3B Instruct
- LLaMA-3 8B Instruct
- Qwen-2.5 3B Instruct
- Qwen-2.5 7B Instruct
- Qwen-2.5 14B Instruct
Metrics
- F1
- Global Eviction Ratio (GER)
- Eviction Rate
- Head-level consensus
- Hallucination rate
- Probing macro-F1
Datasets
- Synthetic suite (Base, Knowledge manipulation, Multi presence, Multi entity, Long context, Coreferen
Benchmarks
- LongBench
- RULER
- InfiniteBench
Context Entities
Models
- GQA / MQA KV-sharing mechanisms (discussed)
- Other efficient attention (Longformer, Linformer, Performer) in background
Metrics
- Perplexity (background)
- Latency/Memory (background)
Datasets
- LongBench, RULER referenced as prior benchmarks
Benchmarks
- LongBench
- RULER

