Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

Overview

Decision SnapshotReady For Pilot

Method is simple, runs at test time without finetuning, and shows consistent empirical gains on OPT models; however largest-model behavior is not shown and some hyperparameters and overhead exist.

Citations10

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava

Links

Abstract / PDF

Why It Matters For Business

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

KV cache (keys/values saved during generation) often uses more memory than model weights and blocks large batch inference. The authors observe that a small set of tokens repeatedly attracts high attention (the "pivotal" tokens). They propose SCISSORHANDS, a budgeted KV-cache algorithm that keeps only likely-important tokens during generation, dropping others without finetuning. Empirically on OPT-family models and multiple tasks SCISSORHANDS reduces KV-cache memory up to 5× with little or no accuracy loss, and it works with 4-bit quantization.

Problem Statement

KV cache size grows with sequence length and batch size and can exceed model weights by multiple times, limiting batch throughput. We need a test-time method that reduces KV-cache memory (sequence-length dimension) without retraining and without harming model quality.

Main Contribution

Persistence of Importance hypothesis: pivotal tokens that were important once stay important later; empirical support across OPT layers and datasets.

SCISSORHANDS: a lightweight, test-time algorithm that enforces a fixed KV-cache budget by tracking attention-based importance and dropping low-impact tokens.

Key Findings

KV cache can be several times larger than model weights and becomes the memory bottleneck.

NumbersKV cache 2.5–5× larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

Practical UseReducing KV-cache memory directly increases feasible batch size and throughput on fixed-memory GPUs.

Evidence RefTable 1

A small subset of tokens in early parts of a sequence account for almost all later attention (persistence ratio high).

NumbersPersistence ratio >95% in most transformer layers (Figure 2)

Practical UseYou can predict which historical tokens will matter later and safely omit many others from the KV cache.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
KV cache reduction	up to 5×	original KV cache	≈80% reduction	OPT family; language modeling and several few-shot tasks	No accuracy degradation reported up to 5× on OPT-66B (Figure 3)	Figure 3; Section 5
Persistence ratio (overlap of pivotal tokens)	>95%	na	—	Measured across layers on OPT models (C4, OpenBookQA, WikiText)	Persistence ratio over 95% in most layers, indicating later pivotal tokens are in the early set	Figure 2; Section 3.2

What To Try In 7 Days

Prototype SCISSORHANDS on a small OPT or LLaMA model to measure KV-cache memory vs accuracy trade-offs.

Run lm-eval-harness few-shot tasks to compare downstream accuracy at target compression ratios.

Combine SCISSORHANDS with existing 4-bit quantization to multiply memory savings and re-evaluate throughput.

Optimization Features

Token Efficiency

budgeted token retention (drop low-importance tokens)

Infra Optimization

enables larger batch size on fixed GPU memory

System Optimization

fixed-memory KV cache managementhead-wise budget allocation

Inference Optimization

KV Cache OptimizationContext CompressionEfficient Inference

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to up to OPT-66B; behavior on very largest models (e.g., OPT-175B, GPT-4 scale) is untested.

Algorithm adds occasional extra attention passes to collect importance, creating short spikes in compute.

When Not To Use

If KV-cache memory is not the deployment bottleneck (e.g., CPU offload already used).

On models or tasks where attention is diffuse and no clear pivotal tokens exist (e.g., random or unusual training regimes).

Failure Modes

Dropping a small set of truly important tokens causes local attention errors that can cascade in long runs.

Lower persistence ratio in later layers may require larger budgets there; wrong allocation hurts quality.

Core Entities

Models

OPT-6BOPT-13BOPT-66BOPT-175BLLaMA-65BBLOOM

Metrics

KV cache size (GB)Compression factor (×)PerplexityAccuracyPersistence ratio (>95%)

Datasets

C4OpenBookQAWikiTextHellaswagMathQAPIQAWinogrande

Benchmarks

language modeling (perplexity)Accuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KV cache can be several times larger than model weights and becomes the memory bottleneck.

A small subset of tokens in early parts of a sequence account for almost all later attention (persistence ratio high).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding