LESS: add a tiny constant low-rank state to sparse KV caches and recover much of full-cache quality while cutting memory

Overview

Decision SnapshotReady For Pilot

LESS shows consistent gains across multiple models, datasets, and sparse policies with concrete speed and memory benefits, but it does not fully match a full KV cache and results vary by task and sparsity level.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

LESS adds a tiny, constant-sized low-rank state to any eviction-based KV cache. It learns to approximate the attention residual left by sparse caching, so evicted tokens still influence future decoding. Results on Llama 2 and Falcon show LESS recovers a substantial fraction of full-cache quality (e.g., ~41% of a Rouge-1 drop recovered on CNN/DailyMail for Falcon 7B), uses negligible extra memory (≈4 token-equivalents), trains cheaply per layer on a single GPU, and reduces end-to-end latency while improving throughput compared to full caching.

Problem Statement

KV caches store past keys/values to avoid recomputing attention during decoding, but the cache memory can exceed model memory and block deployment (e.g., 64 GB KV cache vs 26 GB model). Existing sparse eviction policies prune many KV pairs and save memory but can break tasks that later need the evicted tokens. We need a cheap, easy-to-integrate cache design that keeps memory low yet preserves the ability to recall discarded tokens later.

Main Contribution

LESS: a method that pairs any eviction-based sparse KV cache with a tiny constant-sized low-rank state that accumulates information from evicted KV pairs.

Training protocol that fits per-attention-layer only (no model weight updates), so kernels can be trained on a single GPU and then used without retraining the base model.

Key Findings

LESS recovers a substantial fraction of quality lost by sparse caching on summarization.

Numbers41.4% of Rouge-1 degradation recovered (Falcon 7B, CNN/DailyMail)

Practical UseIf your model degrades with a sparse KV policy, adding LESS can recover a large part of that drop without storing many extra tokens.

Evidence RefAbstract; Section 4.2; Table 4

LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.

NumbersWikiText PPL: Full 8.79, Baseline 13.33, LESS(2%) 10.745 (Llama 2 7B, 2% H2O)

Practical UseOn next-token tasks, LESS often improves predictive quality more than spending the same memory on extra cached KV pairs.

Evidence RefTable 2 (WikiText)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WikiText word perplexity (Llama 2 7B, 2% H2O)	LESS 10.745	Full cache 8.791; Baseline 13.333	LESS reduces PPL vs Baseline by 19.4% (13.333→10.745)	WikiText	Table 2 (word PPL)	Table 2
PG-19 word perplexity (Llama 2 7B, 2% H2O)	LESS 32.157	Full cache 23.787; Baseline 37.013	LESS reduces PPL vs Baseline by 13.1% (37.013→32.157)	PG-19	Table 2 (word PPL)	Table 2

What To Try In 7 Days

Run the LESS repo on a small instance of your LLM (use the provided code link) and compare memory, latency, and quality to your current sparse cache.

Train the per-layer kernels on a few hundred sequences sampled from your workload (single GPU) to test transfer and quality gains.

Benchmark end-to-end throughput/latency with your prompt and generation lengths to measure real deployment impact.

Optimization Features

Token Efficiency

Extra state equals ~4 token-equivalents (R=8) in experimentsRecovers performance better than spending same memory on extra cached tokens

Infra Optimization

Small kernels and per-layer training fit on one GPU; no multi-GPU finetuning required

System Optimization

Reduces end-to-end latency vs full cache (1.1–1.3×) and increases throughput up to 1.7× in tests

Training Optimization

Per-layer kernel training only (no base model updates)Trains on a single GPU; parallelizable across layers

Inference Optimization

Constant-sized low-rank state (memory independent of sequence length)Synthesizes sparse eviction policy outputs with low-rank approximationReplaces storing many evicted KV pairs with recursive low-rank updates

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hdong920/LESS

Data URLs

C4 (used for kernel training)WikiTextPG-19CNN/DailyMailXSumMultiNews

Risks & Boundaries

Limitations

LESS does not fully recover full-cache performance in all settings; gaps remain in many tasks.

Benefits vary by sparse policy, model, and task; sometimes transfer across sparsity levels is imperfect.

When Not To Use

When you can afford a full KV cache and require exact token recall for safety-critical tasks.

When a sparse policy already matches full-cache quality for your workload.

Failure Modes

Still far from full-cache performance on some benchmarks and sparsity settings.

Mismatch between training sparsity and test-time sparsity can reduce effectiveness.

Core Entities

Models

Llama 2 7BLlama 2 13BFalcon 7B

Metrics

word perplexityROUGE-1ROUGE-2ROUGE-Llatency (s)throughput (tokens/s)Hellinger distance

Datasets

WikiTextPG-19CNN/DailyMailXSumMultiNewsC4

Context Entities

Models

Llama 2 7BLlama 2 13BFalcon 7B

Metrics

word perplexityROUGE-1latencythroughput

Datasets

WikiTextPG-19CNN/DailyMailXSumMultiNewsC4

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LESS recovers a substantial fraction of quality lost by sparse caching on summarization.

LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding