MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

May 23, 20247 min

Overview

Decision SnapshotReady For Pilot

The method is training-free, tested on multiple popular LLMs and benchmarks, and demonstrates repeatable memory/throughput gains; limits remain in merging more than two layers and in models with low inter-layer similarity.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang

Links

Abstract / PDF / Code

Why It Matters For Business

MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.

Who Should Care

Summary TLDR

MiniCache compresses the Key-Value (KV) cache used during autoregressive decoding by merging KV states across adjacent transformer layers (depth dimension). It decomposes each KV vector into direction and magnitude, interpolates directions (SLERP), and retains a small set of outlier tokens to avoid quality loss. The method is training-free, works with quantization, and on evaluated models (LLaMA-2/3, Phi-3, Mistral/Mixtral) reaches up to 5.02× compression, ~5× throughput, and ~41% memory reduction versus an FP16 full cache baseline with near-lossless accuracy on tested benchmarks.

Problem Statement

KV cache size grows linearly with sequence length and becomes the dominant GPU memory cost during generation. Existing cache compression focuses inside each layer (quantize/prune per-layer). Cross-layer redundancy (similar KV states across depth) is under-exploited but promising for reducing memory while keeping quality.

Main Contribution

Introduce MiniCache, a training-free method that merges KV cache states across adjacent layers to reduce memory.

Decompose KV vectors into direction and magnitude; interpolate directions with SLERP while preserving magnitudes to reduce information loss.

Key Findings

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Numbers5.02× compression (Table 1, LongBench average)

Practical UseCombine MiniCache with 4-bit KV quantization to get ~5× smaller cache for similar task accuracy on evaluated benchmarks.

Evidence RefTable 1; Abstract

Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.

Numbers≈5× throughput gain (ShareGPT synthetic workloads, Figure 5)

Practical UseUse MiniCache to serve more concurrent requests or larger batches with the same GPU resources.

Evidence RefFigure 5; Abstract

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Compression ratio5.02×FP16 full cache×5.02LongBench / ShareGPT scenariosMiniCache + KIVI-4bit achieves 5.02× (Table 1)Table 1
Decoding throughput≈5×FP16 full cache≈×5ShareGPT synthetic workloads, batch size 1284-bit MiniCache reaches ~5× throughput vs FP16 baseline (Figure 5)Figure 5

What To Try In 7 Days

Run MiniCache code on a dev GPU with your model and sample workloads (project: https://minicache.vmv.re).

Start merging from model midpoint (S = L/2) and use t≈0.6 and retention γ≈0.05 as initial hyperparameters.

Combine MiniCache with an existing KV quantization (e.g., KIVI 4-bit) to maximize memory savings.

Optimization Features

Token Efficiency
token retention (keep 5% by default)retention threshold γ=0.05
Infra Optimization
lower GPU memory cost per requesthigher serving throughput
Model Optimization
training-free (post-training)
System Optimization
enables larger batch-servingreduces peak GPU memory
Inference Optimization
KV Cache OptimizationContext CompressionMerge-from-middle (S=L/2)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

SLERP merge supports only two-layer interpolation; cannot directly merge many layers at once.

Relies on high similarity in middle-to-deep layers; shallow layers show low mergeability.

When Not To Use

When shallow layers carry unique, layer-specific signals you cannot afford to lose.

If your deployment cannot accept any risk of quality change (zero-risk scenarios).

Failure Modes

Merging low-similarity token pairs causes performance drops if retention threshold is set too low.

Wrong interpolation parameter t can bias merged directions and hurt accuracy.

Core Entities

Models

LLaMA-2-7BLLaMA-2-13BLLaMA-3-8BLLaMA-3-70BPhi-3-MiniMistral-7BMixtral-8x7B

Metrics

compression ratiodecoding throughputpeak GPU memoryAccuracy

Datasets

ShareGPTLongBenchGSM8KCOQATruthfulQACOPAMathQAOpenBookQAPIQARTEWinoGrandeXSUMCNN/DailyMail

Benchmarks

LongBenchlm-eval-harness (selected tasks)