MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

May 23, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

3

Authors

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang

Links

Abstract / PDF

Why It Matters For Business

MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.

Summary TLDR

MiniCache compresses the Key-Value (KV) cache used during autoregressive decoding by merging KV states across adjacent transformer layers (depth dimension). It decomposes each KV vector into direction and magnitude, interpolates directions (SLERP), and retains a small set of outlier tokens to avoid quality loss. The method is training-free, works with quantization, and on evaluated models (LLaMA-2/3, Phi-3, Mistral/Mixtral) reaches up to 5.02× compression, ~5× throughput, and ~41% memory reduction versus an FP16 full cache baseline with near-lossless accuracy on tested benchmarks.

Problem Statement

KV cache size grows linearly with sequence length and becomes the dominant GPU memory cost during generation. Existing cache compression focuses inside each layer (quantize/prune per-layer). Cross-layer redundancy (similar KV states across depth) is under-exploited but promising for reducing memory while keeping quality.

Main Contribution

Introduce MiniCache, a training-free method that merges KV cache states across adjacent layers to reduce memory.

Decompose KV vectors into direction and magnitude; interpolate directions with SLERP while preserving magnitudes to reduce information loss.

Add a lightweight token retention rule to keep distinct tokens unmerged and prevent quality drops.

Show compatibility with KV quantization and demonstrate large memory and throughput gains across multiple LLM families and benchmarks.

Key Findings

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Numbers5.02× compression (Table 1, LongBench average)

Throughput increased by about 5× versus FP16 full-cache baseline in batch-serving tests.

Numbers≈5× throughput gain (ShareGPT synthetic workloads, Figure 5)

Memory footprint reduced by ~41% compared to FP16 full cache.

Numbers41% memory reduction (LLaMA-2-7B, batch 128, Figure 5)

Merging works especially well in middle-to-deep layers; up to 87.5% of layers merged with near-zero drop on COQA for LLaMA-3-70B.

Numbers87.5% layers merged with near-zero drop (Section 5, Main results)

Results

Compression ratio

Value5.02×

BaselineFP16 full cache

Decoding throughput

Value≈5×

BaselineFP16 full cache

Peak GPU memory

Value41% reduction

BaselineFP16 full cache

Throughput vs 2-bit KIVI

Value1.29×

BaselineKIVI 2-bit

Layer merge robustness

ValueNear-zero quality drop

Baselinefull cache

Who Should Care

What To Try In 7 Days

Run MiniCache code on a dev GPU with your model and sample workloads (project: https://minicache.vmv.re).

Start merging from model midpoint (S = L/2) and use t≈0.6 and retention γ≈0.05 as initial hyperparameters.

Combine MiniCache with an existing KV quantization (e.g., KIVI 4-bit) to maximize memory savings.

Optimization Features

Token Efficiency

  • token retention (keep 5% by default)
  • retention threshold γ=0.05

Infra Optimization

  • lower GPU memory cost per request
  • higher serving throughput

Model Optimization

  • training-free (post-training)

System Optimization

  • enables larger batch-serving
  • reduces peak GPU memory

Inference Optimization

  • KV Cache Optimization
  • Context Compression
  • Merge-from-middle (S=L/2)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SLERP merge supports only two-layer interpolation; cannot directly merge many layers at once.
  • Relies on high similarity in middle-to-deep layers; shallow layers show low mergeability.
  • Retained-token overhead grows if many tokens are distinct, reducing net compression.
  • Experiments cover mainstream open models and selected benchmarks; other models or tasks may differ.

When Not To Use

  • When shallow layers carry unique, layer-specific signals you cannot afford to lose.
  • If your deployment cannot accept any risk of quality change (zero-risk scenarios).
  • When inter-layer similarity is low for your model or domain.

Failure Modes

  • Merging low-similarity token pairs causes performance drops if retention threshold is set too low.
  • Wrong interpolation parameter t can bias merged directions and hurt accuracy.
  • Attempting multi-layer merges with SLERP may fail because SLERP is defined for pairs.

Core Entities

Models

  • LLaMA-2-7B
  • LLaMA-2-13B
  • LLaMA-3-8B
  • LLaMA-3-70B
  • Phi-3-Mini
  • Mistral-7B
  • Mixtral-8x7B

Metrics

  • compression ratio
  • decoding throughput
  • peak GPU memory
  • Accuracy

Datasets

  • ShareGPT
  • LongBench
  • GSM8K
  • COQA
  • TruthfulQA
  • COPA
  • MathQA
  • OpenBookQA
  • PIQA
  • RTE
  • WinoGrande
  • XSUM
  • CNN/DailyMail

Benchmarks

  • LongBench
  • lm-eval-harness (selected tasks)