Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.
Summary TLDR
MiniCache compresses the Key-Value (KV) cache used during autoregressive decoding by merging KV states across adjacent transformer layers (depth dimension). It decomposes each KV vector into direction and magnitude, interpolates directions (SLERP), and retains a small set of outlier tokens to avoid quality loss. The method is training-free, works with quantization, and on evaluated models (LLaMA-2/3, Phi-3, Mistral/Mixtral) reaches up to 5.02× compression, ~5× throughput, and ~41% memory reduction versus an FP16 full cache baseline with near-lossless accuracy on tested benchmarks.
Problem Statement
KV cache size grows linearly with sequence length and becomes the dominant GPU memory cost during generation. Existing cache compression focuses inside each layer (quantize/prune per-layer). Cross-layer redundancy (similar KV states across depth) is under-exploited but promising for reducing memory while keeping quality.
Main Contribution
Introduce MiniCache, a training-free method that merges KV cache states across adjacent layers to reduce memory.
Decompose KV vectors into direction and magnitude; interpolate directions with SLERP while preserving magnitudes to reduce information loss.
Add a lightweight token retention rule to keep distinct tokens unmerged and prevent quality drops.
Show compatibility with KV quantization and demonstrate large memory and throughput gains across multiple LLM families and benchmarks.
Key Findings
Up to 5.02× KV cache compression when combined with 4-bit KV quantization.
Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.
Memory footprint reduced by ~41% compared to FP16 full cache.
Merging works especially well in middle-to-deep layers; up to 87.5% of layers merged with near-zero drop on COQA for LLaMA-3-70B.
Results
Compression ratio
Decoding throughput
Peak GPU memory
Throughput vs 2-bit KIVI
Layer merge robustness
Who Should Care
What To Try In 7 Days
Run MiniCache code on a dev GPU with your model and sample workloads (project: https://minicache.vmv.re).
Start merging from model midpoint (S = L/2) and use t≈0.6 and retention γ≈0.05 as initial hyperparameters.
Combine MiniCache with an existing KV quantization (e.g., KIVI 4-bit) to maximize memory savings.
Optimization Features
Token Efficiency
- token retention (keep 5% by default)
- retention threshold γ=0.05
Infra Optimization
- lower GPU memory cost per request
- higher serving throughput
Model Optimization
- training-free (post-training)
System Optimization
- enables larger batch-serving
- reduces peak GPU memory
Inference Optimization
- KV Cache Optimization
- Context Compression
- Merge-from-middle (S=L/2)
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- SLERP merge supports only two-layer interpolation; cannot directly merge many layers at once.
- Relies on high similarity in middle-to-deep layers; shallow layers show low mergeability.
- Retained-token overhead grows if many tokens are distinct, reducing net compression.
- Experiments cover mainstream open models and selected benchmarks; other models or tasks may differ.
When Not To Use
- When shallow layers carry unique, layer-specific signals you cannot afford to lose.
- If your deployment cannot accept any risk of quality change (zero-risk scenarios).
- When inter-layer similarity is low for your model or domain.
Failure Modes
- Merging low-similarity token pairs causes performance drops if retention threshold is set too low.
- Wrong interpolation parameter t can bias merged directions and hurt accuracy.
- Attempting multi-layer merges with SLERP may fail because SLERP is defined for pairs.
Core Entities
Models
- LLaMA-2-7B
- LLaMA-2-13B
- LLaMA-3-8B
- LLaMA-3-70B
- Phi-3-Mini
- Mistral-7B
- Mixtral-8x7B
Metrics
- compression ratio
- decoding throughput
- peak GPU memory
- Accuracy
Datasets
- ShareGPT
- LongBench
- GSM8K
- COQA
- TruthfulQA
- COPA
- MathQA
- OpenBookQA
- PIQA
- RTE
- WinoGrande
- XSUM
- CNN/DailyMail
Benchmarks
- LongBench
- lm-eval-harness (selected tasks)

