Reuse multimodal KV caches at any position to cut first-token latency and double serving throughput

February 4, 20258 min

Overview

Decision SnapshotNeeds Validation

MPIC is a systems-level solution tested on common MLLMs and datasets, with clear runtime gains; adopt in services that reuse multimodal content after validating k and storage costs.

Citations1

Evidence Strength0.75

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.

Who Should Care

Summary TLDR

MPIC is a system and algorithm that stores KV caches for images and other multimodal content on disk, loads and computes them in parallel, and uses a selective-attention "partial reuse" method to recompute only text tokens plus the first k image tokens. This preserves generation quality while reducing Time-to-First-Token (TTFT) by up to 54.1% vs. prefix caching and improving throughput by up to 2× in online simulations. Evaluations use LLaVA-1.6 variants on MMDU and SparklesEval; accuracy loss is reported within ~13.6% in the tested settings.

Problem Statement

Existing context caching reuses KV caches only for exact prefixes. Small changes at the start of a prompt force full recomputation, which is costly for multimodal prompts where image KV caches are large. Fully reusing cached KV for arbitrary positions breaks autoregressive attention and hurts quality. The problem: enable position-independent reuse for multimodal inputs without large quality loss or extra engine passes.

Main Contribution

A practical position-independent context caching system (MPIC) for multimodal LLM serving.

Selective-attention algorithm (MPICk) that recomputes all text tokens plus k initial image tokens to avoid attention misalignment.

Key Findings

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

NumbersTTFT reduced up to 54.1% (Fig.9; §5.2)

Practical UseIf you precompute image KV caches and apply MPICk, expect roughly half the prefill latency compared to prefix-only caching on similar workloads.

Evidence RefFig.9, §5.2

Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.

NumbersThroughput improved up to 2.0× (Fig.10; §5.3)

Practical UseFor high request rates, MPIC can serve roughly twice as many token-generation work units as prefix caching or CacheBlend in the tested setup.

Evidence RefFig.10, §5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Time-to-First-Token (TTFT) vs prefix cachingdown up to 54.1%prefix caching−54.1% TTFTMMDU / LLaVA-1.6 variants (offline)Fig.9; authors report MPIC-32 reduces TTFT by up to 54.1% (§5.2)Fig.9, §5.2
Throughput (online simulation)up to 2.0×CacheBlend / prefix caching+2.0× throughputSimulated request traces from MMDU (vLLM API) (§5.3)Fig.10; authors report 2.0× improvement in throughputFig.10, §5.3

What To Try In 7 Days

Precompute and store KV caches for frequently referenced images and set up disk-backed static library.

Implement parallel disk-to-GPU loading for KV caches and benchmark TTFT vs prefix caching on a small workload.

Calibrate k (initial image tokens to recompute) by measuring quality on a held-out set, then deploy MPICk in traffic shadowing.

Optimization Features

Token Efficiency
Recompute only text + k initial image tokens
Infra Optimization
Layer-wise / parallel transfer from disk to GPU (implemented)Fallback to full compute when disk load fails
System Optimization
Disk-backed static/dynamic KV librariesLinker that merges caches position-independently
Inference Optimization
Selective attention partial recompute (MPICk)Single-step linking to avoid two engine passesParallel KV cache load and compute

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

Supplementary material referenced in paper (authors state code and detailed results provided)

Data URLs

MMDU (referenced; Liu et al. 2024d)SparklesEval (referenced; Huang et al. 2024)

Risks & Boundaries

Limitations

Reported quality loss up to ~13.6% in evaluated open-question GPT-score settings.

Requires storing large KV caches on disk (single-image KV can reach ~1 GB).

When Not To Use

Workloads with no repeated multimodal references or low cache hit rates.

Use-cases requiring zero-tolerance for any quality change.

Failure Modes

Disk-load failures cause fallback to full recompute, increasing latency.

Underestimating k can leave attention sinks unaddressed and degrade outputs.

Core Entities

Models

LLaVA-1.6-vicuna-7BLLaVA-1.6-mistral-7BInternVL-2.5

Metrics

Time-to-First-Token (TTFT)Throughput (tokens/sec)GPT score (quality)

Datasets

MMDUSparklesEval

Benchmarks

GPT-assisted score (GPT score judge)