LESS: add a tiny constant low-rank state to sparse KV caches and recover much of full-cache quality while cutting memory

February 14, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen

Links

Abstract / PDF

Why It Matters For Business

LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.

Summary TLDR

LESS adds a tiny, constant-sized low-rank state to any eviction-based KV cache. It learns to approximate the attention residual left by sparse caching, so evicted tokens still influence future decoding. Results on Llama 2 and Falcon show LESS recovers a substantial fraction of full-cache quality (e.g., ~41% of a Rouge-1 drop recovered on CNN/DailyMail for Falcon 7B), uses negligible extra memory (≈4 token-equivalents), trains cheaply per layer on a single GPU, and reduces end-to-end latency while improving throughput compared to full caching.

Problem Statement

KV caches store past keys/values to avoid recomputing attention during decoding, but the cache memory can exceed model memory and block deployment (e.g., 64 GB KV cache vs 26 GB model). Existing sparse eviction policies prune many KV pairs and save memory but can break tasks that later need the evicted tokens. We need a cheap, easy-to-integrate cache design that keeps memory low yet preserves the ability to recall discarded tokens later.

Main Contribution

LESS: a method that pairs any eviction-based sparse KV cache with a tiny constant-sized low-rank state that accumulates information from evicted KV pairs.

Training protocol that fits per-attention-layer only (no model weight updates), so kernels can be trained on a single GPU and then used without retraining the base model.

Empirical evaluation on Llama 2 and Falcon shows substantial quality recovery versus sparse-only baselines, with low memory overhead and better latency/throughput than full caching.

Key Findings

LESS recovers a substantial fraction of quality lost by sparse caching on summarization.

Numbers41.4% of Rouge-1 degradation recovered (Falcon 7B, CNN/DailyMail)

LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.

NumbersWikiText PPL: Full 8.79, Baseline 13.33, LESS(2%) 10.745 (Llama 2 7B, 2% H2O)

LESS uses nearly constant and tiny extra memory equivalent to ~4 tokens in experiments.

NumbersExtra storage ≈ 4 tokens (kernel R=8)

LESS speeds end-to-end generation vs full caching and enables larger batches.

NumbersLatency reduced 1.1–1.3×; throughput increased up to 1.7× (Llama 2 7B/13B on A100 FP16)

LESS is cheap to train and integrates without changing model weights.

NumbersPer-layer training on single GPU; only tiny MLP kernels added (<2% params for Llama2 13B)

Results

WikiText word perplexity (Llama 2 7B, 2% H2O)

ValueLESS 10.745

BaselineFull cache 8.791; Baseline 13.333

PG-19 word perplexity (Llama 2 7B, 2% H2O)

ValueLESS 32.157

BaselineFull cache 23.787; Baseline 37.013

ROUGE-1 (CNN/DailyMail, Llama 2 13B, 408 tokens ~10%)

ValueLESS 25.27

BaselineFull cache 27.55; Baseline 23.57

ROUGE-1 (CNN/DailyMail, Falcon 7B, 408 tokens)

ValueLESS 23.00 (5% H2O)

BaselineFull cache 25.92; Baseline 21.26

Latency and throughput (Llama 2 7B on A100, 2048+2048, batch 24)

ValueLESS latency 95.1 s; throughput 516.9 tokens/s

BaselineFull cache latency 116.7 s; throughput 421.2 tokens/s

Who Should Care

What To Try In 7 Days

Run the LESS repo on a small instance of your LLM (use the provided code link) and compare memory, latency, and quality to your current sparse cache.

Train the per-layer kernels on a few hundred sequences sampled from your workload (single GPU) to test transfer and quality gains.

Benchmark end-to-end throughput/latency with your prompt and generation lengths to measure real deployment impact.

Optimization Features

Token Efficiency

  • Extra state equals ~4 token-equivalents (R=8) in experiments
  • Recovers performance better than spending same memory on extra cached tokens

Infra Optimization

  • Small kernels and per-layer training fit on one GPU; no multi-GPU finetuning required

System Optimization

  • Reduces end-to-end latency vs full cache (1.1–1.3×) and increases throughput up to 1.7× in tests

Training Optimization

  • Per-layer kernel training only (no base model updates)
  • Trains on a single GPU; parallelizable across layers

Inference Optimization

  • Constant-sized low-rank state (memory independent of sequence length)
  • Synthesizes sparse eviction policy outputs with low-rank approximation
  • Replaces storing many evicted KV pairs with recursive low-rank updates

Reproducibility

Data Urls

  • C4 (used for kernel training)
  • WikiText
  • PG-19
  • CNN/DailyMail
  • XSum
  • MultiNews

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LESS does not fully recover full-cache performance in all settings; gaps remain in many tasks.
  • Benefits vary by sparse policy, model, and task; sometimes transfer across sparsity levels is imperfect.
  • Low-rank condensation can dilute very strong token-specific signals needed for exact recall.

When Not To Use

  • When you can afford a full KV cache and require exact token recall for safety-critical tasks.
  • When a sparse policy already matches full-cache quality for your workload.

Failure Modes

  • Still far from full-cache performance on some benchmarks and sparsity settings.
  • Mismatch between training sparsity and test-time sparsity can reduce effectiveness.
  • Extra kernel computations add ≈15% decoding overhead versus baseline+ and can slightly weaken latency gains vs the fastest sparse-only methods.

Core Entities

Models

  • Llama 2 7B
  • Llama 2 13B
  • Falcon 7B

Metrics

  • word perplexity
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • latency (s)
  • throughput (tokens/s)
  • Hellinger distance

Datasets

  • WikiText
  • PG-19
  • CNN/DailyMail
  • XSum
  • MultiNews
  • C4

Context Entities

Models

  • Llama 2 7B
  • Llama 2 13B
  • Falcon 7B

Metrics

  • word perplexity
  • ROUGE-1
  • latency
  • throughput

Datasets

  • WikiText
  • PG-19
  • CNN/DailyMail
  • XSum
  • MultiNews
  • C4