KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

March 2, 20268 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

Links

Abstract / PDF

Why It Matters For Business

Aggressive KV compression can save large memory but induces a sharp risk of hallucination; production systems must preserve routing paths, not just token counts.

Summary TLDR

The paper reframes key-value (KV) cache compression as a structural disturbance of attention routing, not just memory removal. Using controlled synthetic datasets and new metrics (Global Eviction Ratio, head consensus), the authors show that: moderate pruning often leaves accuracy intact despite degraded internal representations; aggressive pruning (~90% removal) triggers a sharp hallucination cliff tied to global eviction of answer tokens; and different model families (LLaMA vs Qwen) allocate routing across depth differently, producing distinct robustness profiles. Practical takeaway: safe KV compression must preserve routing paths, not just token counts.

Problem Statement

KV cache compression methods report large memory savings, but common evaluations focus on aggregate accuracy and ignore whether compressed caches still allow the model to route evidence to the decoder. The paper asks: does compression merely remove redundant storage, or does it break the token-level routing paths that attention needs to access evidence during generation?

Main Contribution

A physics-inspired framing that treats KV compression as a perturbation to attention-based routing, distinguishing storage, accessibility, and utilization.

A controlled synthetic dataset suite (Base, Knowledge manipulation, Multi presence/entity, Coreference, Long context, Hops) designed to probe routing-sensitive failures.

New structural metrics: Global Eviction Ratio (GER) for task-aware route deletion and head-level consensus for routing flexibility.

Empirical discovery of two failure modes—representational erasure and representational rigidity—and a universal hallucination "safety cliff" near ~90% compression across models.

Layerwise analysis showing architectural differences: LLaMA exhibits early consensus then late diversification, Qwen shows early exploration then late consolidation, implying architecture-aware pruning is necessary.

Key Findings

Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.

NumbersF1 ~70 at 0% compression; many tasks stable until ~40% compression (Table 2, Figures 2–3)

There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.

NumbersHallucination rates spike around α ≈ 0.9; GER correlates strongly with hallucination (Figure 22, Figure 24)

Two distinct failure modes occur: (i) representational erasure when answer tokens are evicted, and (ii) representational rigidity when tokens survive but head consensus collapses routing flexibility.

NumbersInstances show low GER but high error when head consensus is high; consensus trends differ by model family (Figures 16,

Results

Base F1 (example)

ValueLLaMA-3 8B AGN: 70.05; Qwen-2.5 7B AGN: 68.80

Baseline0% compression

Compression safety cliff

ValueHallucination spike near 90% compression

Baselinebehavior below ~80% compression

Task robustness range

ValueMany tasks remain >90% F1 until ~40% compression (Knowledge manipulation)

Baseline0% compression

Who Should Care

What To Try In 7 Days

Measure GER on your workloads to estimate route deletion risk under your pruning policy

Run question-aware and question-agnostic pruning on representative queries to compare robustness

Prefer head- and depth-aware pruning (preserve cross-head diversity) before pushing ≥80% KV reduction

Optimization Features

Token Efficiency

  • context compression via eviction and chunking

Infra Optimization

  • KVPress pipeline for unified compression experiments

Model Optimization

  • architecture-aware pruning

System Optimization

  • IO-aware kernels (FlashAttention) mentioned for efficiency

Training Optimization

  • training objectives to encourage cross-head redundancy (proposed future work)

Inference Optimization

  • KV cache pruning (AdaKV, FINCH)
  • KV quantization

Reproducibility

License

  • CC-BY 4.0

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic datasets trade realism for causal control; real-world text may behave differently.
  • Experiments focus on LLaMA and Qwen families and 3–14B scales; results may vary for much larger or different architectures.
  • No public release of the synthetic dataset or experiment code is stated, limiting direct replication.
  • Analysis omits systems with external memory or retrieval chains which may change reachability dynamics.

When Not To Use

  • When you cannot tolerate any hallucination risk at extreme compression (e.g., legal or safety-critical text generation)
  • If your workload relies heavily on multi-hop relational reasoning or distributed bridging tokens
  • When using models or architectures not evaluated here without additional validation

Failure Modes

  • Representational erasure: answer tokens are globally evicted across heads
  • Representational rigidity: tokens survive but head-level consensus collapses rerouting
  • Query-conditioned overcommitment: question-aware pruning can force premature commitments and increase confident errors

Core Entities

Models

  • LLaMA-3.2 3B Instruct
  • LLaMA-3 8B Instruct
  • Qwen-2.5 3B Instruct
  • Qwen-2.5 7B Instruct
  • Qwen-2.5 14B Instruct

Metrics

  • F1
  • Global Eviction Ratio (GER)
  • Eviction Rate
  • Head-level consensus
  • Hallucination rate
  • Probing macro-F1

Datasets

  • Synthetic suite (Base, Knowledge manipulation, Multi presence, Multi entity, Long context, Coreferen

Benchmarks

  • LongBench
  • RULER
  • InfiniteBench

Context Entities

Models

  • GQA / MQA KV-sharing mechanisms (discussed)
  • Other efficient attention (Longformer, Linformer, Performer) in background

Metrics

  • Perplexity (background)
  • Latency/Memory (background)

Datasets

  • LongBench, RULER referenced as prior benchmarks

Benchmarks

  • LongBench
  • RULER