KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

March 2, 20268 min

Overview

Decision SnapshotNeeds Validation

The paper gives controlled synthetic evidence and metrics linking GER and consensus to failures, but results are limited to synthetic data and five model checkpoints, so more real-world validation is needed before deployment guidance is universal.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: Partial

License: CC-BY 4.0

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

Links

Abstract / PDF / Code

Why It Matters For Business

Aggressive KV compression can save large memory but induces a sharp risk of hallucination; production systems must preserve routing paths, not just token counts.

Who Should Care

Summary TLDR

The paper reframes key-value (KV) cache compression as a structural disturbance of attention routing, not just memory removal. Using controlled synthetic datasets and new metrics (Global Eviction Ratio, head consensus), the authors show that: moderate pruning often leaves accuracy intact despite degraded internal representations; aggressive pruning (~90% removal) triggers a sharp hallucination cliff tied to global eviction of answer tokens; and different model families (LLaMA vs Qwen) allocate routing across depth differently, producing distinct robustness profiles. Practical takeaway: safe KV compression must preserve routing paths, not just token counts.

Problem Statement

KV cache compression methods report large memory savings, but common evaluations focus on aggregate accuracy and ignore whether compressed caches still allow the model to route evidence to the decoder. The paper asks: does compression merely remove redundant storage, or does it break the token-level routing paths that attention needs to access evidence during generation?

Main Contribution

A physics-inspired framing that treats KV compression as a perturbation to attention-based routing, distinguishing storage, accessibility, and utilization.

A controlled synthetic dataset suite (Base, Knowledge manipulation, Multi presence/entity, Coreference, Long context, Hops) designed to probe routing-sensitive failures.

Key Findings

Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.

NumbersF1 ~70 at 0% compression; many tasks stable until ~40% compression (Table 2, Figures 23)

Practical UseYou can trim KV memory moderately (tens of percent) without big accuracy loss, but internal features weaken—so probes and downstream tasks sensitive to representation quality may break earlier than accuracy suggests.

Evidence RefTable 2; Figures 2–3

There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.

NumbersHallucination rates spike around α ≈ 0.9; GER correlates strongly with hallucination (Figure 22, Figure 24)

Practical UseAvoid extreme KV eviction policies that remove ~90% of entries unless you can guarantee preservation of at least one head-wise route to answer tokens; otherwise hallucinations become common.

Evidence RefFigures 22, 24

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Base F1 (example)LLaMA-3 8B AGN: 70.05; Qwen-2.5 7B AGN: 68.800% compressionBase task (Table 2)Table 2 reports base F1 per model and setupTable 2
Compression safety cliffHallucination spike near 90% compressionbehavior below ~80% compressionAggregate across synthetic suiteFigures 21–24 show error rates and GER correlationFigures 22, 24

What To Try In 7 Days

Measure GER on your workloads to estimate route deletion risk under your pruning policy

Run question-aware and question-agnostic pruning on representative queries to compare robustness

Prefer head- and depth-aware pruning (preserve cross-head diversity) before pushing ≥80% KV reduction

Optimization Features

Token Efficiency
context compression via eviction and chunking
Infra Optimization
KVPress pipeline for unified compression experiments
Model Optimization
architecture-aware pruning
System Optimization
IO-aware kernels (FlashAttention) mentioned for efficiency
Training Optimization
training objectives to encourage cross-head redundancy (proposed future work)
Inference Optimization
KV cache pruning (AdaKV, FINCH)KV quantization

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseCC-BY 4.0

Risks & Boundaries

Limitations

Synthetic datasets trade realism for causal control; real-world text may behave differently.

Experiments focus on LLaMA and Qwen families and 3–14B scales; results may vary for much larger or different architectures.

When Not To Use

When you cannot tolerate any hallucination risk at extreme compression (e.g., legal or safety-critical text generation)

If your workload relies heavily on multi-hop relational reasoning or distributed bridging tokens

Failure Modes

Representational erasure: answer tokens are globally evicted across heads

Representational rigidity: tokens survive but head-level consensus collapses rerouting

Core Entities

Models

LLaMA-3.2 3B InstructLLaMA-3 8B InstructQwen-2.5 3B InstructQwen-2.5 7B InstructQwen-2.5 14B Instruct

Metrics

F1Global Eviction Ratio (GER)Eviction RateHead-level consensusHallucination rateProbing macro-F1

Datasets

Synthetic suite (Base, Knowledge manipulation, Multi presence, Multi entity, Long context, Coreferen

Benchmarks

LongBenchRULERInfiniteBench

Context Entities

Models

GQA / MQA KV-sharing mechanisms (discussed)Other efficient attention (Longformer, Linformer, Performer) in background

Metrics

Perplexity (background)Latency/Memory (background)

Datasets

LongBench, RULER referenced as prior benchmarks

Benchmarks

LongBenchRULER