MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

Overview

Decision SnapshotReady For Pilot

The method is training-free, tested on multiple popular LLMs and benchmarks, and demonstrates repeatable memory/throughput gains; limits remain in merging more than two layers and in models with low inter-layer similarity.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang

Links

Abstract / PDF / Code

Why It Matters For Business

MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

MiniCache compresses the Key-Value (KV) cache used during autoregressive decoding by merging KV states across adjacent transformer layers (depth dimension). It decomposes each KV vector into direction and magnitude, interpolates directions (SLERP), and retains a small set of outlier tokens to avoid quality loss. The method is training-free, works with quantization, and on evaluated models (LLaMA-2/3, Phi-3, Mistral/Mixtral) reaches up to 5.02× compression, ~5× throughput, and ~41% memory reduction versus an FP16 full cache baseline with near-lossless accuracy on tested benchmarks.

Problem Statement

KV cache size grows linearly with sequence length and becomes the dominant GPU memory cost during generation. Existing cache compression focuses inside each layer (quantize/prune per-layer). Cross-layer redundancy (similar KV states across depth) is under-exploited but promising for reducing memory while keeping quality.

Main Contribution

Introduce MiniCache, a training-free method that merges KV cache states across adjacent layers to reduce memory.

Decompose KV vectors into direction and magnitude; interpolate directions with SLERP while preserving magnitudes to reduce information loss.

Key Findings

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Numbers5.02× compression (Table 1, LongBench average)

Practical UseCombine MiniCache with 4-bit KV quantization to get ~5× smaller cache for similar task accuracy on evaluated benchmarks.

Evidence RefTable 1; Abstract

Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.

Numbers≈5× throughput gain (ShareGPT synthetic workloads, Figure 5)

Practical UseUse MiniCache to serve more concurrent requests or larger batches with the same GPU resources.

Evidence RefFigure 5; Abstract

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Compression ratio	5.02×	FP16 full cache	×5.02	LongBench / ShareGPT scenarios	MiniCache + KIVI-4bit achieves 5.02× (Table 1)	Table 1
Decoding throughput	≈5×	FP16 full cache	≈×5	ShareGPT synthetic workloads, batch size 128	4-bit MiniCache reaches ~5× throughput vs FP16 baseline (Figure 5)	Figure 5

What To Try In 7 Days

Run MiniCache code on a dev GPU with your model and sample workloads (project: https://minicache.vmv.re).

Start merging from model midpoint (S = L/2) and use t≈0.6 and retention γ≈0.05 as initial hyperparameters.

Combine MiniCache with an existing KV quantization (e.g., KIVI 4-bit) to maximize memory savings.

Optimization Features

Token Efficiency

token retention (keep 5% by default)retention threshold γ=0.05

Infra Optimization

lower GPU memory cost per requesthigher serving throughput

Model Optimization

training-free (post-training)

System Optimization

enables larger batch-servingreduces peak GPU memory

Inference Optimization

KV Cache OptimizationContext CompressionMerge-from-middle (S=L/2)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://minicache.vmv.re https://arxiv.org/abs/2405.14366

Risks & Boundaries

Limitations

SLERP merge supports only two-layer interpolation; cannot directly merge many layers at once.

Relies on high similarity in middle-to-deep layers; shallow layers show low mergeability.

When Not To Use

When shallow layers carry unique, layer-specific signals you cannot afford to lose.

If your deployment cannot accept any risk of quality change (zero-risk scenarios).

Failure Modes

Merging low-similarity token pairs causes performance drops if retention threshold is set too low.

Wrong interpolation parameter t can bias merged directions and hurt accuracy.

Core Entities

Models

LLaMA-2-7BLLaMA-2-13BLLaMA-3-8BLLaMA-3-70BPhi-3-MiniMistral-7BMixtral-8x7B

Metrics

compression ratiodecoding throughputpeak GPU memoryAccuracy

Datasets

ShareGPTLongBenchGSM8KCOQATruthfulQACOPAMathQAOpenBookQAPIQARTEWinoGrandeXSUMCNN/DailyMail

Benchmarks

LongBenchlm-eval-harness (selected tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding