LESS: add a tiny constant low-rank state to sparse KV caches and recover much of full-cache quality while cutting memory

February 14, 20248 min

Overview

Decision SnapshotReady For Pilot

LESS shows consistent gains across multiple models, datasets, and sparse policies with concrete speed and memory benefits, but it does not fully match a full KV cache and results vary by task and sparsity level.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.

Who Should Care

Summary TLDR

LESS adds a tiny, constant-sized low-rank state to any eviction-based KV cache. It learns to approximate the attention residual left by sparse caching, so evicted tokens still influence future decoding. Results on Llama 2 and Falcon show LESS recovers a substantial fraction of full-cache quality (e.g., ~41% of a Rouge-1 drop recovered on CNN/DailyMail for Falcon 7B), uses negligible extra memory (≈4 token-equivalents), trains cheaply per layer on a single GPU, and reduces end-to-end latency while improving throughput compared to full caching.

Problem Statement

KV caches store past keys/values to avoid recomputing attention during decoding, but the cache memory can exceed model memory and block deployment (e.g., 64 GB KV cache vs 26 GB model). Existing sparse eviction policies prune many KV pairs and save memory but can break tasks that later need the evicted tokens. We need a cheap, easy-to-integrate cache design that keeps memory low yet preserves the ability to recall discarded tokens later.

Main Contribution

LESS: a method that pairs any eviction-based sparse KV cache with a tiny constant-sized low-rank state that accumulates information from evicted KV pairs.

Training protocol that fits per-attention-layer only (no model weight updates), so kernels can be trained on a single GPU and then used without retraining the base model.

Key Findings

LESS recovers a substantial fraction of quality lost by sparse caching on summarization.

Numbers41.4% of Rouge-1 degradation recovered (Falcon 7B, CNN/DailyMail)

Practical UseIf your model degrades with a sparse KV policy, adding LESS can recover a large part of that drop without storing many extra tokens.

Evidence RefAbstract; Section 4.2; Table 4

LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.

NumbersWikiText PPL: Full 8.79, Baseline 13.33, LESS(2%) 10.745 (Llama 2 7B, 2% H2O)

Practical UseOn next-token tasks, LESS often improves predictive quality more than spending the same memory on extra cached KV pairs.

Evidence RefTable 2 (WikiText)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WikiText word perplexity (Llama 2 7B, 2% H2O)LESS 10.745Full cache 8.791; Baseline 13.333LESS reduces PPL vs Baseline by 19.4% (13.33310.745)WikiTextTable 2 (word PPL)Table 2
PG-19 word perplexity (Llama 2 7B, 2% H2O)LESS 32.157Full cache 23.787; Baseline 37.013LESS reduces PPL vs Baseline by 13.1% (37.01332.157)PG-19Table 2 (word PPL)Table 2

What To Try In 7 Days

Run the LESS repo on a small instance of your LLM (use the provided code link) and compare memory, latency, and quality to your current sparse cache.

Train the per-layer kernels on a few hundred sequences sampled from your workload (single GPU) to test transfer and quality gains.

Benchmark end-to-end throughput/latency with your prompt and generation lengths to measure real deployment impact.

Optimization Features

Token Efficiency
Extra state equals ~4 token-equivalents (R=8) in experimentsRecovers performance better than spending same memory on extra cached tokens
Infra Optimization
Small kernels and per-layer training fit on one GPU; no multi-GPU finetuning required
System Optimization
Reduces end-to-end latency vs full cache (1.1–1.3×) and increases throughput up to 1.7× in tests
Training Optimization
Per-layer kernel training only (no base model updates)Trains on a single GPU; parallelizable across layers
Inference Optimization
Constant-sized low-rank state (memory independent of sequence length)Synthesizes sparse eviction policy outputs with low-rank approximationReplaces storing many evicted KV pairs with recursive low-rank updates

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4 (used for kernel training)WikiTextPG-19CNN/DailyMailXSumMultiNews

Risks & Boundaries

Limitations

LESS does not fully recover full-cache performance in all settings; gaps remain in many tasks.

Benefits vary by sparse policy, model, and task; sometimes transfer across sparsity levels is imperfect.

When Not To Use

When you can afford a full KV cache and require exact token recall for safety-critical tasks.

When a sparse policy already matches full-cache quality for your workload.

Failure Modes

Still far from full-cache performance on some benchmarks and sparsity settings.

Mismatch between training sparsity and test-time sparsity can reduce effectiveness.

Core Entities

Models

Llama 2 7BLlama 2 13BFalcon 7B

Metrics

word perplexityROUGE-1ROUGE-2ROUGE-Llatency (s)throughput (tokens/s)Hellinger distance

Datasets

WikiTextPG-19CNN/DailyMailXSumMultiNewsC4

Context Entities

Models

Llama 2 7BLlama 2 13BFalcon 7B

Metrics

word perplexityROUGE-1latencythroughput

Datasets

WikiTextPG-19CNN/DailyMailXSumMultiNewsC4