Overview
The paper gives controlled synthetic evidence and metrics linking GER and consensus to failures, but results are limited to synthetic data and five model checkpoints, so more real-world validation is needed before deployment guidance is universal.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/3
Reproducibility
Status: No open assets linked
Open source: Partial
License: CC-BY 4.0
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 70%
Why It Matters For Business
Aggressive KV compression can save large memory but induces a sharp risk of hallucination; production systems must preserve routing paths, not just token counts.
Who Should Care
Summary TLDR
The paper reframes key-value (KV) cache compression as a structural disturbance of attention routing, not just memory removal. Using controlled synthetic datasets and new metrics (Global Eviction Ratio, head consensus), the authors show that: moderate pruning often leaves accuracy intact despite degraded internal representations; aggressive pruning (~90% removal) triggers a sharp hallucination cliff tied to global eviction of answer tokens; and different model families (LLaMA vs Qwen) allocate routing across depth differently, producing distinct robustness profiles. Practical takeaway: safe KV compression must preserve routing paths, not just token counts.
Problem Statement
KV cache compression methods report large memory savings, but common evaluations focus on aggregate accuracy and ignore whether compressed caches still allow the model to route evidence to the decoder. The paper asks: does compression merely remove redundant storage, or does it break the token-level routing paths that attention needs to access evidence during generation?
Main Contribution
A physics-inspired framing that treats KV compression as a perturbation to attention-based routing, distinguishing storage, accessibility, and utilization.
A controlled synthetic dataset suite (Base, Knowledge manipulation, Multi presence/entity, Coreference, Long context, Hops) designed to probe routing-sensitive failures.
Key Findings
Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.
There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Base F1 (example) | LLaMA-3 8B AGN: 70.05; Qwen-2.5 7B AGN: 68.80 | 0% compression | — | Base task (Table 2) | Table 2 reports base F1 per model and setup | Table 2 |
| Compression safety cliff | Hallucination spike near 90% compression | behavior below ~80% compression | — | Aggregate across synthetic suite | Figures 21–24 show error rates and GER correlation | Figures 22, 24 |
What To Try In 7 Days
Measure GER on your workloads to estimate route deletion risk under your pruning policy
Run question-aware and question-agnostic pruning on representative queries to compare robustness
Prefer head- and depth-aware pruning (preserve cross-head diversity) before pushing ≥80% KV reduction
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Synthetic datasets trade realism for causal control; real-world text may behave differently.
Experiments focus on LLaMA and Qwen families and 3–14B scales; results may vary for much larger or different architectures.
When Not To Use
When you cannot tolerate any hallucination risk at extreme compression (e.g., legal or safety-critical text generation)
If your workload relies heavily on multi-hop relational reasoning or distributed bridging tokens
Failure Modes
Representational erasure: answer tokens are globally evicted across heads
Representational rigidity: tokens survive but head-level consensus collapses rerouting

