Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

May 26, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

10

Authors

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava

Links

Abstract / PDF

Why It Matters For Business

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Summary TLDR

KV cache (keys/values saved during generation) often uses more memory than model weights and blocks large batch inference. The authors observe that a small set of tokens repeatedly attracts high attention (the "pivotal" tokens). They propose SCISSORHANDS, a budgeted KV-cache algorithm that keeps only likely-important tokens during generation, dropping others without finetuning. Empirically on OPT-family models and multiple tasks SCISSORHANDS reduces KV-cache memory up to 5× with little or no accuracy loss, and it works with 4-bit quantization.

Problem Statement

KV cache size grows with sequence length and batch size and can exceed model weights by multiple times, limiting batch throughput. We need a test-time method that reduces KV-cache memory (sequence-length dimension) without retraining and without harming model quality.

Main Contribution

Persistence of Importance hypothesis: pivotal tokens that were important once stay important later; empirical support across OPT layers and datasets.

SCISSORHANDS: a lightweight, test-time algorithm that enforces a fixed KV-cache budget by tracking attention-based importance and dropping low-impact tokens.

Theoretical bounds that connect power-law attention distributions to bounded approximation error when dropping low-importance tokens.

Empirical validation: up to 5× KV-cache reduction on OPT models with minimal accuracy loss and compatibility with 4-bit quantization.

Key Findings

KV cache can be several times larger than model weights and becomes the memory bottleneck.

NumbersKV cache 2.5–5× larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

A small subset of tokens in early parts of a sequence account for almost all later attention (persistence ratio high).

NumbersPersistence ratio >95% in most transformer layers (Figure 2)

SCISSORHANDS reduces KV-cache memory up to 5× with no clear accuracy drop on evaluated models/tasks.

NumbersUp to 5× KV-cache reduction without accuracy loss on OPT-66B (Figure 3; conclusion)

SCISSORHANDS is compatible with low-bit quantization without compounding errors.

NumbersHellaswag scores unchanged when adding 4-bit quantization on top of SCISSORHANDS (Table 3)

Results

KV cache reduction

Valueup to 5×

Baselineoriginal KV cache

Persistence ratio (overlap of pivotal tokens)

Value>95%

Baselinena

Perplexity maintenance

Valueno drop until compression thresholds

Baselineoriginal perplexity

Accuracy

Valueunchanged

Baselineoriginal model accuracy (e.g., 0.702 for OPT-6B)

Who Should Care

What To Try In 7 Days

Prototype SCISSORHANDS on a small OPT or LLaMA model to measure KV-cache memory vs accuracy trade-offs.

Run lm-eval-harness few-shot tasks to compare downstream accuracy at target compression ratios.

Combine SCISSORHANDS with existing 4-bit quantization to multiply memory savings and re-evaluate throughput.

Optimization Features

Token Efficiency

  • budgeted token retention (drop low-importance tokens)

Infra Optimization

  • enables larger batch size on fixed GPU memory

System Optimization

  • fixed-memory KV cache management
  • head-wise budget allocation

Inference Optimization

  • KV Cache Optimization
  • Context Compression
  • Efficient Inference

Reproducibility

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Experiments limited to up to OPT-66B; behavior on very largest models (e.g., OPT-175B, GPT-4 scale) is untested.
  • Algorithm adds occasional extra attention passes to collect importance, creating short spikes in compute.
  • Method relies on the model's learned attention patterns; randomly initialized or differently trained models may not show persistence.
  • Budget allocation per head/layer requires tuning (w, r, m) and a rule of thumb is used.

When Not To Use

  • If KV-cache memory is not the deployment bottleneck (e.g., CPU offload already used).
  • On models or tasks where attention is diffuse and no clear pivotal tokens exist (e.g., random or unusual training regimes).
  • When strict bit-for-bit reproducibility of generation is required.

Failure Modes

  • Dropping a small set of truly important tokens causes local attention errors that can cascade in long runs.
  • Lower persistence ratio in later layers may require larger budgets there; wrong allocation hurts quality.
  • Accumulated approximation error over very long generation can degrade outputs beyond tested sequence lengths.

Core Entities

Models

  • OPT-6B
  • OPT-13B
  • OPT-66B
  • OPT-175B
  • LLaMA-65B
  • BLOOM

Metrics

  • KV cache size (GB)
  • Compression factor (×)
  • Perplexity
  • Accuracy
  • Persistence ratio (>95%)

Datasets

  • C4
  • OpenBookQA
  • WikiText
  • Hellaswag
  • MathQA
  • PIQA
  • Winogrande

Benchmarks

  • language modeling (perplexity)
  • Accuracy