Overview
The method is training-free, tested on multiple popular LLMs and benchmarks, and demonstrates repeatable memory/throughput gains; limits remain in merging more than two layers and in models with low inter-layer similarity.
Citations3
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.
Who Should Care
Summary TLDR
MiniCache compresses the Key-Value (KV) cache used during autoregressive decoding by merging KV states across adjacent transformer layers (depth dimension). It decomposes each KV vector into direction and magnitude, interpolates directions (SLERP), and retains a small set of outlier tokens to avoid quality loss. The method is training-free, works with quantization, and on evaluated models (LLaMA-2/3, Phi-3, Mistral/Mixtral) reaches up to 5.02× compression, ~5× throughput, and ~41% memory reduction versus an FP16 full cache baseline with near-lossless accuracy on tested benchmarks.
Problem Statement
KV cache size grows linearly with sequence length and becomes the dominant GPU memory cost during generation. Existing cache compression focuses inside each layer (quantize/prune per-layer). Cross-layer redundancy (similar KV states across depth) is under-exploited but promising for reducing memory while keeping quality.
Main Contribution
Introduce MiniCache, a training-free method that merges KV cache states across adjacent layers to reduce memory.
Decompose KV vectors into direction and magnitude; interpolate directions with SLERP while preserving magnitudes to reduce information loss.
Key Findings
Up to 5.02× KV cache compression when combined with 4-bit KV quantization.
Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Compression ratio | 5.02× | FP16 full cache | ×5.02 | LongBench / ShareGPT scenarios | MiniCache + KIVI-4bit achieves 5.02× (Table 1) | Table 1 |
| Decoding throughput | ≈5× | FP16 full cache | ≈×5 | ShareGPT synthetic workloads, batch size 128 | 4-bit MiniCache reaches ~5× throughput vs FP16 baseline (Figure 5) | Figure 5 |
What To Try In 7 Days
Run MiniCache code on a dev GPU with your model and sample workloads (project: https://minicache.vmv.re).
Start merging from model midpoint (S = L/2) and use t≈0.6 and retention γ≈0.05 as initial hyperparameters.
Combine MiniCache with an existing KV quantization (e.g., KIVI 4-bit) to maximize memory savings.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
SLERP merge supports only two-layer interpolation; cannot directly merge many layers at once.
Relies on high similarity in middle-to-deep layers; shallow layers show low mergeability.
When Not To Use
When shallow layers carry unique, layer-specific signals you cannot afford to lose.
If your deployment cannot accept any risk of quality change (zero-risk scenarios).
Failure Modes
Merging low-similarity token pairs causes performance drops if retention threshold is set too low.
Wrong interpolation parameter t can bias merged directions and hurt accuracy.

