Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Halve or more the KV cache memory during decoding without retraining and keep compatibility with FlashAttention, lowering inference cost and easing deployment on bandwidth-limited hardware.
Summary TLDR
The authors find a strong empirical link between the L2 norm of key embeddings and attention weights in decoder-only Transformers. Using this, they compress the KV cache by retaining keys with the lowest L2 norms (and their values). This simple, training-free heuristic preserves language-model perplexity up to ~50% eviction and keeps near-perfect accuracy on retrieval tests (99% at 50% for a needle-in-a-haystack task; 100% on passkey retrieval with 90% eviction). The method does not rely on attention scores and is compatible with FlashAttention, easing deployment.
Problem Statement
KV caches grow linearly with context length and can dominate memory and IO during decoding. Existing compression methods either require finetuning or need attention scores (incompatible with FlashAttention), limiting practical deployment. The paper asks: can we compress KV caches cheaply and plug it into real inference stacks without retraining?
Main Contribution
Empirical finding: low L2 norm of key embeddings often predicts high attention during decoding.
A zero-training heuristic: compress the KV cache by keeping keys with lowest L2 norms and their values.
Show that this heuristic preserves performance while cutting memory: ~50% eviction for language modelling and needle-in-a-haystack; up to 90% for passkey retrieval.
Practical compatibility: method does not use attention scores, so it integrates with FlashAttention and standard inference pipelines.
Key Findings
Low L2 norm keys correlate with high attention scores.
Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.
Needle-in-a-haystack accuracy is preserved: ~99% at 50% eviction.
Passkey retrieval remains perfect even with aggressive compression.
Method does not require attention scores and so works with FlashAttention.
Low-L2 keys show sparse peaked activations dominated by few dimensions (attention sinks).
Results
Perplexity (language modelling)
Accuracy
Accuracy
Compatibility with FlashAttention and baseline comparison
Who Should Care
What To Try In 7 Days
Implement L2-norm eviction: compute key L2 norms and keep lowest X% during inference.
Start with X=50% for general LM workloads; verify perplexity on a small Wikipedia slice.
Try aggressive X=80–90% on retrieval-style tasks and verify accuracy on a few synthetic queries first (passkey/needle tasks).
Agent Features
Memory
- KV cache compression
Optimization Features
Token Efficiency
- Context Compression
Infra Optimization
- Reduced HBM/IO pressure
System Optimization
- FlashAttention-compatible compression
Inference Optimization
- KV Cache Optimization
- Efficient Inference
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on models up to 8B parameters (Llama family, Gemma).
- Correlation varies by layer and head; first two layers often weaker.
- No formal theoretical explanation for why low L2 norm predicts high attention.
When Not To Use
- On very large models not yet tested (scale-up unknown).
- If layer/head analysis shows poor L2–attention correlation for your model.
- When you cannot tolerate any accuracy drop and have ample memory.
Failure Modes
- Aggressive eviction beyond pretraining context length increases perplexity and hurts accuracy.
- If you evict high-attention keys (e.g., by keeping high L2 norm instead), performance can collapse.
- Per-head and per-layer variability can make a fixed global cutoff suboptimal.
Core Entities
Models
- Llama-2-7b
- Llama-2-7b-32k
- Llama-2-7b-80k
- Llama-3-8b
- Llama3.1-8b
- Gemma
Metrics
- perplexity
- Accuracy
- attention loss (ALR)
Datasets
- Wikipedia (chunks)
- LongBench subsets (NarrativeQA, Qasper, HotpotQA, 2WikiMQA, QMSum)
- needle-in-a-haystack (synthetic)
- passkey retrieval (synthetic)
Benchmarks
- LongBench
- needle-in-a-haystack
- passkey retrieval

