Overview
LESS shows consistent gains across multiple models, datasets, and sparse policies with concrete speed and memory benefits, but it does not fully match a full KV cache and results vary by task and sparsity level.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LESS cuts KV-cache memory needs with tiny extra state while restoring much of full-cache quality, lowering GPU costs and enabling larger batches or longer sequences in production.
Who Should Care
Summary TLDR
LESS adds a tiny, constant-sized low-rank state to any eviction-based KV cache. It learns to approximate the attention residual left by sparse caching, so evicted tokens still influence future decoding. Results on Llama 2 and Falcon show LESS recovers a substantial fraction of full-cache quality (e.g., ~41% of a Rouge-1 drop recovered on CNN/DailyMail for Falcon 7B), uses negligible extra memory (≈4 token-equivalents), trains cheaply per layer on a single GPU, and reduces end-to-end latency while improving throughput compared to full caching.
Problem Statement
KV caches store past keys/values to avoid recomputing attention during decoding, but the cache memory can exceed model memory and block deployment (e.g., 64 GB KV cache vs 26 GB model). Existing sparse eviction policies prune many KV pairs and save memory but can break tasks that later need the evicted tokens. We need a cheap, easy-to-integrate cache design that keeps memory low yet preserves the ability to recall discarded tokens later.
Main Contribution
LESS: a method that pairs any eviction-based sparse KV cache with a tiny constant-sized low-rank state that accumulates information from evicted KV pairs.
Training protocol that fits per-attention-layer only (no model weight updates), so kernels can be trained on a single GPU and then used without retraining the base model.
Key Findings
LESS recovers a substantial fraction of quality lost by sparse caching on summarization.
LESS reduces language-model perplexity vs sparse-only baselines using very small extra state.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WikiText word perplexity (Llama 2 7B, 2% H2O) | LESS 10.745 | Full cache 8.791; Baseline 13.333 | LESS reduces PPL vs Baseline by 19.4% (13.333→10.745) | WikiText | Table 2 (word PPL) | Table 2 |
| PG-19 word perplexity (Llama 2 7B, 2% H2O) | LESS 32.157 | Full cache 23.787; Baseline 37.013 | LESS reduces PPL vs Baseline by 13.1% (37.013→32.157) | PG-19 | Table 2 (word PPL) | Table 2 |
What To Try In 7 Days
Run the LESS repo on a small instance of your LLM (use the provided code link) and compare memory, latency, and quality to your current sparse cache.
Train the per-layer kernels on a few hundred sequences sampled from your workload (single GPU) to test transfer and quality gains.
Benchmark end-to-end throughput/latency with your prompt and generation lengths to measure real deployment impact.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
LESS does not fully recover full-cache performance in all settings; gaps remain in many tasks.
Benefits vary by sparse policy, model, and task; sometimes transfer across sparsity levels is imperfect.
When Not To Use
When you can afford a full KV cache and require exact token recall for safety-critical tasks.
When a sparse policy already matches full-cache quality for your workload.
Failure Modes
Still far from full-cache performance on some benchmarks and sparsity settings.
Mismatch between training sparsity and test-time sparsity can reduce effectiveness.

