Keep only low L2‑norm keys to cut KV cache 50–90% with little accuracy loss

June 17, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

Links

Abstract / PDF

Why It Matters For Business

Halve or more the KV cache memory during decoding without retraining and keep compatibility with FlashAttention, lowering inference cost and easing deployment on bandwidth-limited hardware.

Summary TLDR

The authors find a strong empirical link between the L2 norm of key embeddings and attention weights in decoder-only Transformers. Using this, they compress the KV cache by retaining keys with the lowest L2 norms (and their values). This simple, training-free heuristic preserves language-model perplexity up to ~50% eviction and keeps near-perfect accuracy on retrieval tests (99% at 50% for a needle-in-a-haystack task; 100% on passkey retrieval with 90% eviction). The method does not rely on attention scores and is compatible with FlashAttention, easing deployment.

Problem Statement

KV caches grow linearly with context length and can dominate memory and IO during decoding. Existing compression methods either require finetuning or need attention scores (incompatible with FlashAttention), limiting practical deployment. The paper asks: can we compress KV caches cheaply and plug it into real inference stacks without retraining?

Main Contribution

Empirical finding: low L2 norm of key embeddings often predicts high attention during decoding.

A zero-training heuristic: compress the KV cache by keeping keys with lowest L2 norms and their values.

Show that this heuristic preserves performance while cutting memory: ~50% eviction for language modelling and needle-in-a-haystack; up to 90% for passkey retrieval.

Practical compatibility: method does not use attention scores, so it integrates with FlashAttention and standard inference pipelines.

Key Findings

Low L2 norm keys correlate with high attention scores.

Language-model perplexity stays stable when evicting up to 50% of KV pairs by this rule.

Numbers≤50% eviction → no perceptible perplexity increase (Wikipedia eval)

Needle-in-a-haystack accuracy is preserved: ~99% at 50% eviction.

Numbers50% eviction → 99% accuracy

Passkey retrieval remains perfect even with aggressive compression.

Numbers90% eviction → 100% accuracy

Method does not require attention scores and so works with FlashAttention.

Low-L2 keys show sparse peaked activations dominated by few dimensions (attention sinks).

Results

Perplexity (language modelling)

ValueNo significant change up to 50% KV eviction

Baselineno compression

Accuracy

Value99% accuracy at 50% KV eviction

Baselinenear-100% no compression

Accuracy

Value100% accuracy at 90% KV eviction

Baseline100% no compression

Compatibility with FlashAttention and baseline comparison

ValueOutperforms FastGen up to 50% eviction; works with FlashAttention since it avoids attention scores

BaselineFastGen (attention-based)

Who Should Care

What To Try In 7 Days

Implement L2-norm eviction: compute key L2 norms and keep lowest X% during inference.

Start with X=50% for general LM workloads; verify perplexity on a small Wikipedia slice.

Try aggressive X=80–90% on retrieval-style tasks and verify accuracy on a few synthetic queries first (passkey/needle tasks).

Agent Features

Memory

  • KV cache compression

Optimization Features

Token Efficiency

  • Context Compression

Infra Optimization

  • Reduced HBM/IO pressure

System Optimization

  • FlashAttention-compatible compression

Inference Optimization

  • KV Cache Optimization
  • Efficient Inference

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on models up to 8B parameters (Llama family, Gemma).
  • Correlation varies by layer and head; first two layers often weaker.
  • No formal theoretical explanation for why low L2 norm predicts high attention.

When Not To Use

  • On very large models not yet tested (scale-up unknown).
  • If layer/head analysis shows poor L2–attention correlation for your model.
  • When you cannot tolerate any accuracy drop and have ample memory.

Failure Modes

  • Aggressive eviction beyond pretraining context length increases perplexity and hurts accuracy.
  • If you evict high-attention keys (e.g., by keeping high L2 norm instead), performance can collapse.
  • Per-head and per-layer variability can make a fixed global cutoff suboptimal.

Core Entities

Models

  • Llama-2-7b
  • Llama-2-7b-32k
  • Llama-2-7b-80k
  • Llama-3-8b
  • Llama3.1-8b
  • Gemma

Metrics

  • perplexity
  • Accuracy
  • attention loss (ALR)

Datasets

  • Wikipedia (chunks)
  • LongBench subsets (NarrativeQA, Qasper, HotpotQA, 2WikiMQA, QMSum)
  • needle-in-a-haystack (synthetic)
  • passkey retrieval (synthetic)

Benchmarks

  • LongBench
  • needle-in-a-haystack
  • passkey retrieval