Keep only low L2‑norm keys to cut KV cache 50–90% with little accuracy loss

Overview

Decision SnapshotNeeds Validation

Empirical results on multiple Llama and Gemma variants show strong practical gains, but tests are limited to ≤8B models and synthetic long-context tasks.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

Links

Abstract / PDF / Code

Why It Matters For Business

Halve or more the KV cache memory during decoding without retraining and keep compatibility with FlashAttention, lowering inference cost and easing deployment on bandwidth-limited hardware.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

The authors find a strong empirical link between the L2 norm of key embeddings and attention weights in decoder-only Transformers. Using this, they compress the KV cache by retaining keys with the lowest L2 norms (and their values). This simple, training-free heuristic preserves language-model perplexity up to ~50% eviction and keeps near-perfect accuracy on retrieval tests (99% at 50% for a needle-in-a-haystack task; 100% on passkey retrieval with 90% eviction). The method does not rely on attention scores and is compatible with FlashAttention, easing deployment.

Problem Statement

KV caches grow linearly with context length and can dominate memory and IO during decoding. Existing compression methods either require finetuning or need attention scores (incompatible with FlashAttention), limiting practical deployment. The paper asks: can we compress KV caches cheaply and plug it into real inference stacks without retraining?

Main Contribution

Empirical finding: low L2 norm of key embeddings often predicts high attention during decoding.

A zero-training heuristic: compress the KV cache by keeping keys with lowest L2 norms and their values.

Key Findings

Low L2 norm keys correlate with high attention scores.

Practical UseYou can estimate token importance using the L2 norm of cached keys before queries arrive.

Evidence RefFigure 1; Figure 2

Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.

Numbers≤50% eviction → no perceptible perplexity increase (Wikipedia eval)

Practical UseEvict half of the KV cache using low-L2 selection to halve memory use without hurting LM quality.

Evidence RefFigure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (language modelling)	No significant change up to 50% KV eviction	no compression	—	Wikipedia	Figure 3	Figure 3
Accuracy	99% accuracy at 50% KV eviction	near-100% no compression	-≈1 percentage point vs baseline	needle-in-a-haystack (synthetic)	Figure 4a	Figure 4a

What To Try In 7 Days

Implement L2-norm eviction: compute key L2 norms and keep lowest X% during inference.

Start with X=50% for general LM workloads; verify perplexity on a small Wikipedia slice.

Try aggressive X=80–90% on retrieval-style tasks and verify accuracy on a few synthetic queries first (passkey/needle tasks).

Agent Features

Memory

KV cache compression

Optimization Features

Token Efficiency

Context Compression

Infra Optimization

Reduced HBM/IO pressure

System Optimization

FlashAttention-compatible compression

Inference Optimization

KV Cache OptimizationEfficient Inference

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/alessiodevoto/l2compress

Risks & Boundaries

Limitations

Evaluated only on models up to 8B parameters (Llama family, Gemma).

Correlation varies by layer and head; first two layers often weaker.

When Not To Use

On very large models not yet tested (scale-up unknown).

If layer/head analysis shows poor L2–attention correlation for your model.

Failure Modes

Aggressive eviction beyond pretraining context length increases perplexity and hurts accuracy.

If you evict high-attention keys (e.g., by keeping high L2 norm instead), performance can collapse.

Core Entities

Models

Llama-2-7bLlama-2-7b-32kLlama-2-7b-80kLlama-3-8bLlama3.1-8bGemma

Metrics

perplexityAccuracyattention loss (ALR)

Datasets

Wikipedia (chunks)LongBench subsets (NarrativeQA, Qasper, HotpotQA, 2WikiMQA, QMSum)needle-in-a-haystack (synthetic)passkey retrieval (synthetic)

Benchmarks

LongBenchneedle-in-a-haystackpasskey retrieval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Low L2 norm keys correlate with high attention scores.

Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding