Keep only low L2‑norm keys to cut KV cache 50–90% with little accuracy loss

June 17, 20247 min

Overview

Decision SnapshotNeeds Validation

Empirical results on multiple Llama and Gemma variants show strong practical gains, but tests are limited to ≤8B models and synthetic long-context tasks.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

Links

Abstract / PDF / Code

Why It Matters For Business

Halve or more the KV cache memory during decoding without retraining and keep compatibility with FlashAttention, lowering inference cost and easing deployment on bandwidth-limited hardware.

Who Should Care

Summary TLDR

The authors find a strong empirical link between the L2 norm of key embeddings and attention weights in decoder-only Transformers. Using this, they compress the KV cache by retaining keys with the lowest L2 norms (and their values). This simple, training-free heuristic preserves language-model perplexity up to ~50% eviction and keeps near-perfect accuracy on retrieval tests (99% at 50% for a needle-in-a-haystack task; 100% on passkey retrieval with 90% eviction). The method does not rely on attention scores and is compatible with FlashAttention, easing deployment.

Problem Statement

KV caches grow linearly with context length and can dominate memory and IO during decoding. Existing compression methods either require finetuning or need attention scores (incompatible with FlashAttention), limiting practical deployment. The paper asks: can we compress KV caches cheaply and plug it into real inference stacks without retraining?

Main Contribution

Empirical finding: low L2 norm of key embeddings often predicts high attention during decoding.

A zero-training heuristic: compress the KV cache by keeping keys with lowest L2 norms and their values.

Key Findings

Low L2 norm keys correlate with high attention scores.

Practical UseYou can estimate token importance using the L2 norm of cached keys before queries arrive.

Evidence RefFigure 1; Figure 2

Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.

Numbers≤50% eviction → no perceptible perplexity increase (Wikipedia eval)

Practical UseEvict half of the KV cache using low-L2 selection to halve memory use without hurting LM quality.

Evidence RefFigure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (language modelling)No significant change up to 50% KV evictionno compressionWikipediaFigure 3Figure 3
Accuracy99% accuracy at 50% KV evictionnear-100% no compression-≈1 percentage point vs baselineneedle-in-a-haystack (synthetic)Figure 4aFigure 4a

What To Try In 7 Days

Implement L2-norm eviction: compute key L2 norms and keep lowest X% during inference.

Start with X=50% for general LM workloads; verify perplexity on a small Wikipedia slice.

Try aggressive X=80–90% on retrieval-style tasks and verify accuracy on a few synthetic queries first (passkey/needle tasks).

Agent Features

Memory
KV cache compression

Optimization Features

Token Efficiency
Context Compression
Infra Optimization
Reduced HBM/IO pressure
System Optimization
FlashAttention-compatible compression
Inference Optimization
KV Cache OptimizationEfficient Inference

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on models up to 8B parameters (Llama family, Gemma).

Correlation varies by layer and head; first two layers often weaker.

When Not To Use

On very large models not yet tested (scale-up unknown).

If layer/head analysis shows poor L2–attention correlation for your model.

Failure Modes

Aggressive eviction beyond pretraining context length increases perplexity and hurts accuracy.

If you evict high-attention keys (e.g., by keeping high L2 norm instead), performance can collapse.

Core Entities

Models

Llama-2-7bLlama-2-7b-32kLlama-2-7b-80kLlama-3-8bLlama3.1-8bGemma

Metrics

perplexityAccuracyattention loss (ALR)

Datasets

Wikipedia (chunks)LongBench subsets (NarrativeQA, Qasper, HotpotQA, 2WikiMQA, QMSum)needle-in-a-haystack (synthetic)passkey retrieval (synthetic)

Benchmarks

LongBenchneedle-in-a-haystackpasskey retrieval