Reuse multimodal KV caches at any position to cut first-token latency and double serving throughput

Overview

Decision SnapshotNeeds Validation

MPIC is a systems-level solution tested on common MLLMs and datasets, with clear runtime gains; adopt in services that reuse multimodal content after validating k and storage costs.

Citations1

Evidence Strength0.75

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

MPIC is a system and algorithm that stores KV caches for images and other multimodal content on disk, loads and computes them in parallel, and uses a selective-attention "partial reuse" method to recompute only text tokens plus the first k image tokens. This preserves generation quality while reducing Time-to-First-Token (TTFT) by up to 54.1% vs. prefix caching and improving throughput by up to 2× in online simulations. Evaluations use LLaVA-1.6 variants on MMDU and SparklesEval; accuracy loss is reported within ~13.6% in the tested settings.

Problem Statement

Existing context caching reuses KV caches only for exact prefixes. Small changes at the start of a prompt force full recomputation, which is costly for multimodal prompts where image KV caches are large. Fully reusing cached KV for arbitrary positions breaks autoregressive attention and hurts quality. The problem: enable position-independent reuse for multimodal inputs without large quality loss or extra engine passes.

Main Contribution

A practical position-independent context caching system (MPIC) for multimodal LLM serving.

Selective-attention algorithm (MPICk) that recomputes all text tokens plus k initial image tokens to avoid attention misalignment.

Key Findings

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

NumbersTTFT reduced up to 54.1% (Fig.9; §5.2)

Practical UseIf you precompute image KV caches and apply MPICk, expect roughly half the prefill latency compared to prefix-only caching on similar workloads.

Evidence RefFig.9, §5.2

Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.

NumbersThroughput improved up to 2.0× (Fig.10; §5.3)

Practical UseFor high request rates, MPIC can serve roughly twice as many token-generation work units as prefix caching or CacheBlend in the tested setup.

Evidence RefFig.10, §5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Time-to-First-Token (TTFT) vs prefix caching	down up to 54.1%	prefix caching	−54.1% TTFT	MMDU / LLaVA-1.6 variants (offline)	Fig.9; authors report MPIC-32 reduces TTFT by up to 54.1% (§5.2)	Fig.9, §5.2
Throughput (online simulation)	up to 2.0×	CacheBlend / prefix caching	+2.0× throughput	Simulated request traces from MMDU (vLLM API) (§5.3)	Fig.10; authors report 2.0× improvement in throughput	Fig.10, §5.3

What To Try In 7 Days

Precompute and store KV caches for frequently referenced images and set up disk-backed static library.

Implement parallel disk-to-GPU loading for KV caches and benchmark TTFT vs prefix caching on a small workload.

Calibrate k (initial image tokens to recompute) by measuring quality on a held-out set, then deploy MPICk in traffic shadowing.

Optimization Features

Token Efficiency

Recompute only text + k initial image tokens

Infra Optimization

Layer-wise / parallel transfer from disk to GPU (implemented)Fallback to full compute when disk load fails

System Optimization

Disk-backed static/dynamic KV librariesLinker that merges caches position-independently

Inference Optimization

Selective attention partial recompute (MPICk)Single-step linking to avoid two engine passesParallel KV cache load and compute

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

Supplementary material referenced in paper (authors state code and detailed results provided)

Data URLs

MMDU (referenced; Liu et al. 2024d)SparklesEval (referenced; Huang et al. 2024)

Risks & Boundaries

Limitations

Reported quality loss up to ~13.6% in evaluated open-question GPT-score settings.

Requires storing large KV caches on disk (single-image KV can reach ~1 GB).

When Not To Use

Workloads with no repeated multimodal references or low cache hit rates.

Use-cases requiring zero-tolerance for any quality change.

Failure Modes

Disk-load failures cause fallback to full recompute, increasing latency.

Underestimating k can leave attention sinks unaddressed and degrade outputs.

Core Entities

Models

LLaVA-1.6-vicuna-7BLLaVA-1.6-mistral-7BInternVL-2.5

Metrics

Time-to-First-Token (TTFT)Throughput (tokens/sec)GPT score (quality)

Datasets

MMDUSparklesEval

Benchmarks

GPT-assisted score (GPT score judge)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding