Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
10
Why It Matters For Business
Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.
Summary TLDR
KV cache (keys/values saved during generation) often uses more memory than model weights and blocks large batch inference. The authors observe that a small set of tokens repeatedly attracts high attention (the "pivotal" tokens). They propose SCISSORHANDS, a budgeted KV-cache algorithm that keeps only likely-important tokens during generation, dropping others without finetuning. Empirically on OPT-family models and multiple tasks SCISSORHANDS reduces KV-cache memory up to 5× with little or no accuracy loss, and it works with 4-bit quantization.
Problem Statement
KV cache size grows with sequence length and batch size and can exceed model weights by multiple times, limiting batch throughput. We need a test-time method that reduces KV-cache memory (sequence-length dimension) without retraining and without harming model quality.
Main Contribution
Persistence of Importance hypothesis: pivotal tokens that were important once stay important later; empirical support across OPT layers and datasets.
SCISSORHANDS: a lightweight, test-time algorithm that enforces a fixed KV-cache budget by tracking attention-based importance and dropping low-impact tokens.
Theoretical bounds that connect power-law attention distributions to bounded approximation error when dropping low-importance tokens.
Empirical validation: up to 5× KV-cache reduction on OPT models with minimal accuracy loss and compatibility with 4-bit quantization.
Key Findings
KV cache can be several times larger than model weights and becomes the memory bottleneck.
A small subset of tokens in early parts of a sequence account for almost all later attention (persistence ratio high).
SCISSORHANDS reduces KV-cache memory up to 5× with no clear accuracy drop on evaluated models/tasks.
SCISSORHANDS is compatible with low-bit quantization without compounding errors.
Results
KV cache reduction
Persistence ratio (overlap of pivotal tokens)
Perplexity maintenance
Accuracy
Who Should Care
What To Try In 7 Days
Prototype SCISSORHANDS on a small OPT or LLaMA model to measure KV-cache memory vs accuracy trade-offs.
Run lm-eval-harness few-shot tasks to compare downstream accuracy at target compression ratios.
Combine SCISSORHANDS with existing 4-bit quantization to multiply memory savings and re-evaluate throughput.
Optimization Features
Token Efficiency
- budgeted token retention (drop low-importance tokens)
Infra Optimization
- enables larger batch size on fixed GPU memory
System Optimization
- fixed-memory KV cache management
- head-wise budget allocation
Inference Optimization
- KV Cache Optimization
- Context Compression
- Efficient Inference
Reproducibility
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Experiments limited to up to OPT-66B; behavior on very largest models (e.g., OPT-175B, GPT-4 scale) is untested.
- Algorithm adds occasional extra attention passes to collect importance, creating short spikes in compute.
- Method relies on the model's learned attention patterns; randomly initialized or differently trained models may not show persistence.
- Budget allocation per head/layer requires tuning (w, r, m) and a rule of thumb is used.
When Not To Use
- If KV-cache memory is not the deployment bottleneck (e.g., CPU offload already used).
- On models or tasks where attention is diffuse and no clear pivotal tokens exist (e.g., random or unusual training regimes).
- When strict bit-for-bit reproducibility of generation is required.
Failure Modes
- Dropping a small set of truly important tokens causes local attention errors that can cascade in long runs.
- Lower persistence ratio in later layers may require larger budgets there; wrong allocation hurts quality.
- Accumulated approximation error over very long generation can degrade outputs beyond tested sequence lengths.
Core Entities
Models
- OPT-6B
- OPT-13B
- OPT-66B
- OPT-175B
- LLaMA-65B
- BLOOM
Metrics
- KV cache size (GB)
- Compression factor (×)
- Perplexity
- Accuracy
- Persistence ratio (>95%)
Datasets
- C4
- OpenBookQA
- WikiText
- Hellaswag
- MathQA
- PIQA
- Winogrande
Benchmarks
- language modeling (perplexity)
- Accuracy

