KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Overview

Decision SnapshotNeeds Validation

The paper gives controlled synthetic evidence and metrics linking GER and consensus to failures, but results are limited to synthetic data and five model checkpoints, so more real-world validation is needed before deployment guidance is universal.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: Partial

License: CC-BY 4.0

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

Links

Abstract / PDF / Code

Why It Matters For Business

Aggressive KV compression can save large memory but induces a sharp risk of hallucination; production systems must preserve routing paths, not just token counts.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

The paper reframes key-value (KV) cache compression as a structural disturbance of attention routing, not just memory removal. Using controlled synthetic datasets and new metrics (Global Eviction Ratio, head consensus), the authors show that: moderate pruning often leaves accuracy intact despite degraded internal representations; aggressive pruning (~90% removal) triggers a sharp hallucination cliff tied to global eviction of answer tokens; and different model families (LLaMA vs Qwen) allocate routing across depth differently, producing distinct robustness profiles. Practical takeaway: safe KV compression must preserve routing paths, not just token counts.

Problem Statement

KV cache compression methods report large memory savings, but common evaluations focus on aggregate accuracy and ignore whether compressed caches still allow the model to route evidence to the decoder. The paper asks: does compression merely remove redundant storage, or does it break the token-level routing paths that attention needs to access evidence during generation?

Main Contribution

A physics-inspired framing that treats KV compression as a perturbation to attention-based routing, distinguishing storage, accessibility, and utilization.

A controlled synthetic dataset suite (Base, Knowledge manipulation, Multi presence/entity, Coreference, Long context, Hops) designed to probe routing-sensitive failures.

Key Findings

Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.

NumbersF1 ~70 at 0% compression; many tasks stable until ~40% compression (Table 2, Figures 2–3)

Practical UseYou can trim KV memory moderately (tens of percent) without big accuracy loss, but internal features weaken—so probes and downstream tasks sensitive to representation quality may break earlier than accuracy suggests.

Evidence RefTable 2; Figures 2–3

There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.

NumbersHallucination rates spike around α ≈ 0.9; GER correlates strongly with hallucination (Figure 22, Figure 24)

Practical UseAvoid extreme KV eviction policies that remove ~90% of entries unless you can guarantee preservation of at least one head-wise route to answer tokens; otherwise hallucinations become common.

Evidence RefFigures 22, 24

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Base F1 (example)	LLaMA-3 8B AGN: 70.05; Qwen-2.5 7B AGN: 68.80	0% compression	—	Base task (Table 2)	Table 2 reports base F1 per model and setup	Table 2
Compression safety cliff	Hallucination spike near 90% compression	behavior below ~80% compression	—	Aggregate across synthetic suite	Figures 21–24 show error rates and GER correlation	Figures 22, 24

What To Try In 7 Days

Measure GER on your workloads to estimate route deletion risk under your pruning policy

Run question-aware and question-agnostic pruning on representative queries to compare robustness

Prefer head- and depth-aware pruning (preserve cross-head diversity) before pushing ≥80% KV reduction

Optimization Features

Token Efficiency

context compression via eviction and chunking

Infra Optimization

KVPress pipeline for unified compression experiments

Model Optimization

architecture-aware pruning

System Optimization

IO-aware kernels (FlashAttention) mentioned for efficiency

Training Optimization

training objectives to encourage cross-head redundancy (proposed future work)

Inference Optimization

KV cache pruning (AdaKV, FINCH)KV quantization

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseCC-BY 4.0

Code URLs

https://github.com/NVIDIA/kvpress/

Risks & Boundaries

Limitations

Synthetic datasets trade realism for causal control; real-world text may behave differently.

Experiments focus on LLaMA and Qwen families and 3–14B scales; results may vary for much larger or different architectures.

When Not To Use

When you cannot tolerate any hallucination risk at extreme compression (e.g., legal or safety-critical text generation)

If your workload relies heavily on multi-hop relational reasoning or distributed bridging tokens

Failure Modes

Representational erasure: answer tokens are globally evicted across heads

Representational rigidity: tokens survive but head-level consensus collapses rerouting

Core Entities

Models

LLaMA-3.2 3B InstructLLaMA-3 8B InstructQwen-2.5 3B InstructQwen-2.5 7B InstructQwen-2.5 14B Instruct

Metrics

F1Global Eviction Ratio (GER)Eviction RateHead-level consensusHallucination rateProbing macro-F1

Datasets

Synthetic suite (Base, Knowledge manipulation, Multi presence, Multi entity, Long context, Coreferen

Benchmarks

LongBenchRULERInfiniteBench

Context Entities

Models

GQA / MQA KV-sharing mechanisms (discussed)Other efficient attention (Longformer, Linformer, Performer) in background

Metrics

Perplexity (background)Latency/Memory (background)

Datasets

LongBench, RULER referenced as prior benchmarks

Benchmarks

LongBenchRULER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Moderate KV compression degrades internal representations but often leaves task accuracy near baseline.

There is a sharp hallucination 'safety cliff' near ~90% compression tied to global erasure of answer tokens.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

Key finding