Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.
Summary TLDR
LESS adds a tiny, constant-sized low-rank state to any eviction-based KV cache. It learns to approximate the attention residual left by sparse caching, so evicted tokens still influence future decoding. Results on Llama 2 and Falcon show LESS recovers a substantial fraction of full-cache quality (e.g., ~41% of a Rouge-1 drop recovered on CNN/DailyMail for Falcon 7B), uses negligible extra memory (≈4 token-equivalents), trains cheaply per layer on a single GPU, and reduces end-to-end latency while improving throughput compared to full caching.
Problem Statement
KV caches store past keys/values to avoid recomputing attention during decoding, but the cache memory can exceed model memory and block deployment (e.g., 64 GB KV cache vs 26 GB model). Existing sparse eviction policies prune many KV pairs and save memory but can break tasks that later need the evicted tokens. We need a cheap, easy-to-integrate cache design that keeps memory low yet preserves the ability to recall discarded tokens later.
Main Contribution
LESS: a method that pairs any eviction-based sparse KV cache with a tiny constant-sized low-rank state that accumulates information from evicted KV pairs.
Training protocol that fits per-attention-layer only (no model weight updates), so kernels can be trained on a single GPU and then used without retraining the base model.
Empirical evaluation on Llama 2 and Falcon shows substantial quality recovery versus sparse-only baselines, with low memory overhead and better latency/throughput than full caching.
Key Findings
LESS recovers a substantial fraction of quality lost by sparse caching on summarization.
LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.
LESS uses nearly constant and tiny extra memory equivalent to ~4 tokens in experiments.
LESS speeds end-to-end generation vs full caching and enables larger batches.
LESS is cheap to train and integrates without changing model weights.
Results
WikiText word perplexity (Llama 2 7B, 2% H2O)
PG-19 word perplexity (Llama 2 7B, 2% H2O)
ROUGE-1 (CNN/DailyMail, Llama 2 13B, 408 tokens ~10%)
ROUGE-1 (CNN/DailyMail, Falcon 7B, 408 tokens)
Latency and throughput (Llama 2 7B on A100, 2048+2048, batch 24)
Who Should Care
What To Try In 7 Days
Run the LESS repo on a small instance of your LLM (use the provided code link) and compare memory, latency, and quality to your current sparse cache.
Train the per-layer kernels on a few hundred sequences sampled from your workload (single GPU) to test transfer and quality gains.
Benchmark end-to-end throughput/latency with your prompt and generation lengths to measure real deployment impact.
Optimization Features
Token Efficiency
- Extra state equals ~4 token-equivalents (R=8) in experiments
- Recovers performance better than spending same memory on extra cached tokens
Infra Optimization
- Small kernels and per-layer training fit on one GPU; no multi-GPU finetuning required
System Optimization
- Reduces end-to-end latency vs full cache (1.1–1.3×) and increases throughput up to 1.7× in tests
Training Optimization
- Per-layer kernel training only (no base model updates)
- Trains on a single GPU; parallelizable across layers
Inference Optimization
- Constant-sized low-rank state (memory independent of sequence length)
- Synthesizes sparse eviction policy outputs with low-rank approximation
- Replaces storing many evicted KV pairs with recursive low-rank updates
Reproducibility
Code Urls
Data Urls
- C4 (used for kernel training)
- WikiText
- PG-19
- CNN/DailyMail
- XSum
- MultiNews
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LESS does not fully recover full-cache performance in all settings; gaps remain in many tasks.
- Benefits vary by sparse policy, model, and task; sometimes transfer across sparsity levels is imperfect.
- Low-rank condensation can dilute very strong token-specific signals needed for exact recall.
When Not To Use
- When you can afford a full KV cache and require exact token recall for safety-critical tasks.
- When a sparse policy already matches full-cache quality for your workload.
Failure Modes
- Still far from full-cache performance on some benchmarks and sparsity settings.
- Mismatch between training sparsity and test-time sparsity can reduce effectiveness.
- Extra kernel computations add ≈15% decoding overhead versus baseline+ and can slightly weaken latency gains vs the fastest sparse-only methods.
Core Entities
Models
- Llama 2 7B
- Llama 2 13B
- Falcon 7B
Metrics
- word perplexity
- ROUGE-1
- ROUGE-2
- ROUGE-L
- latency (s)
- throughput (tokens/s)
- Hellinger distance
Datasets
- WikiText
- PG-19
- CNN/DailyMail
- XSum
- MultiNews
- C4
Context Entities
Models
- Llama 2 7B
- Llama 2 13B
- Falcon 7B
Metrics
- word perplexity
- ROUGE-1
- latency
- throughput
Datasets
- WikiText
- PG-19
- CNN/DailyMail
- XSum
- MultiNews
- C4

