Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

May 26, 20237 min

Overview

Decision SnapshotReady For Pilot

Method is simple, runs at test time without finetuning, and shows consistent empirical gains on OPT models; however largest-model behavior is not shown and some hyperparameters and overhead exist.

Citations10

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava

Links

Abstract / PDF

Why It Matters For Business

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Who Should Care

Summary TLDR

KV cache (keys/values saved during generation) often uses more memory than model weights and blocks large batch inference. The authors observe that a small set of tokens repeatedly attracts high attention (the "pivotal" tokens). They propose SCISSORHANDS, a budgeted KV-cache algorithm that keeps only likely-important tokens during generation, dropping others without finetuning. Empirically on OPT-family models and multiple tasks SCISSORHANDS reduces KV-cache memory up to 5× with little or no accuracy loss, and it works with 4-bit quantization.

Problem Statement

KV cache size grows with sequence length and batch size and can exceed model weights by multiple times, limiting batch throughput. We need a test-time method that reduces KV-cache memory (sequence-length dimension) without retraining and without harming model quality.

Main Contribution

Persistence of Importance hypothesis: pivotal tokens that were important once stay important later; empirical support across OPT layers and datasets.

SCISSORHANDS: a lightweight, test-time algorithm that enforces a fixed KV-cache budget by tracking attention-based importance and dropping low-impact tokens.

Key Findings

KV cache can be several times larger than model weights and becomes the memory bottleneck.

NumbersKV cache 2.5 larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

Practical UseReducing KV-cache memory directly increases feasible batch size and throughput on fixed-memory GPUs.

Evidence RefTable 1

A small subset of tokens in early parts of a sequence account for almost all later attention (persistence ratio high).

NumbersPersistence ratio >95% in most transformer layers (Figure 2)

Practical UseYou can predict which historical tokens will matter later and safely omit many others from the KV cache.

Evidence RefFigure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
KV cache reductionup to original KV cache≈80% reductionOPT family; language modeling and several few-shot tasksNo accuracy degradation reported up to 5× on OPT-66B (Figure 3)Figure 3; Section 5
Persistence ratio (overlap of pivotal tokens)>95%naMeasured across layers on OPT models (C4, OpenBookQA, WikiText)Persistence ratio over 95% in most layers, indicating later pivotal tokens are in the early setFigure 2; Section 3.2

What To Try In 7 Days

Prototype SCISSORHANDS on a small OPT or LLaMA model to measure KV-cache memory vs accuracy trade-offs.

Run lm-eval-harness few-shot tasks to compare downstream accuracy at target compression ratios.

Combine SCISSORHANDS with existing 4-bit quantization to multiply memory savings and re-evaluate throughput.

Optimization Features

Token Efficiency
budgeted token retention (drop low-importance tokens)
Infra Optimization
enables larger batch size on fixed GPU memory
System Optimization
fixed-memory KV cache managementhead-wise budget allocation
Inference Optimization
KV Cache OptimizationContext CompressionEfficient Inference

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to up to OPT-66B; behavior on very largest models (e.g., OPT-175B, GPT-4 scale) is untested.

Algorithm adds occasional extra attention passes to collect importance, creating short spikes in compute.

When Not To Use

If KV-cache memory is not the deployment bottleneck (e.g., CPU offload already used).

On models or tasks where attention is diffuse and no clear pivotal tokens exist (e.g., random or unusual training regimes).

Failure Modes

Dropping a small set of truly important tokens causes local attention errors that can cascade in long runs.

Lower persistence ratio in later layers may require larger budgets there; wrong allocation hurts quality.

Core Entities

Models

OPT-6BOPT-13BOPT-66BOPT-175BLLaMA-65BBLOOM

Metrics

KV cache size (GB)Compression factor (×)PerplexityAccuracyPersistence ratio (>95%)

Datasets

C4OpenBookQAWikiTextHellaswagMathQAPIQAWinogrande

Benchmarks

language modeling (perplexity)Accuracy