Overview
Empirical results on multiple Llama and Gemma variants show strong practical gains, but tests are limited to ≤8B models and synthetic long-context tasks.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Halve or more the KV cache memory during decoding without retraining and keep compatibility with FlashAttention, lowering inference cost and easing deployment on bandwidth-limited hardware.
Who Should Care
Summary TLDR
The authors find a strong empirical link between the L2 norm of key embeddings and attention weights in decoder-only Transformers. Using this, they compress the KV cache by retaining keys with the lowest L2 norms (and their values). This simple, training-free heuristic preserves language-model perplexity up to ~50% eviction and keeps near-perfect accuracy on retrieval tests (99% at 50% for a needle-in-a-haystack task; 100% on passkey retrieval with 90% eviction). The method does not rely on attention scores and is compatible with FlashAttention, easing deployment.
Problem Statement
KV caches grow linearly with context length and can dominate memory and IO during decoding. Existing compression methods either require finetuning or need attention scores (incompatible with FlashAttention), limiting practical deployment. The paper asks: can we compress KV caches cheaply and plug it into real inference stacks without retraining?
Main Contribution
Empirical finding: low L2 norm of key embeddings often predicts high attention during decoding.
A zero-training heuristic: compress the KV cache by keeping keys with lowest L2 norms and their values.
Key Findings
Low L2 norm keys correlate with high attention scores.
Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (language modelling) | No significant change up to 50% KV eviction | no compression | — | Wikipedia | Figure 3 | Figure 3 |
| Accuracy | 99% accuracy at 50% KV eviction | near-100% no compression | -≈1 percentage point vs baseline | needle-in-a-haystack (synthetic) | Figure 4a | Figure 4a |
What To Try In 7 Days
Implement L2-norm eviction: compute key L2 norms and keep lowest X% during inference.
Start with X=50% for general LM workloads; verify perplexity on a small Wikipedia slice.
Try aggressive X=80–90% on retrieval-style tasks and verify accuracy on a few synthetic queries first (passkey/needle tasks).
Agent Features
Memory
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluated only on models up to 8B parameters (Llama family, Gemma).
Correlation varies by layer and head; first two layers often weaker.
When Not To Use
On very large models not yet tested (scale-up unknown).
If layer/head analysis shows poor L2–attention correlation for your model.
Failure Modes
Aggressive eviction beyond pretraining context length increases perplexity and hurts accuracy.
If you evict high-attention keys (e.g., by keeping high L2 norm instead), performance can collapse.

