Reuse multimodal KV caches at any position to cut first-token latency and double serving throughput

February 4, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen

Links

Abstract / PDF

Why It Matters For Business

For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.

Summary TLDR

MPIC is a system and algorithm that stores KV caches for images and other multimodal content on disk, loads and computes them in parallel, and uses a selective-attention "partial reuse" method to recompute only text tokens plus the first k image tokens. This preserves generation quality while reducing Time-to-First-Token (TTFT) by up to 54.1% vs. prefix caching and improving throughput by up to 2× in online simulations. Evaluations use LLaVA-1.6 variants on MMDU and SparklesEval; accuracy loss is reported within ~13.6% in the tested settings.

Problem Statement

Existing context caching reuses KV caches only for exact prefixes. Small changes at the start of a prompt force full recomputation, which is costly for multimodal prompts where image KV caches are large. Fully reusing cached KV for arbitrary positions breaks autoregressive attention and hurts quality. The problem: enable position-independent reuse for multimodal inputs without large quality loss or extra engine passes.

Main Contribution

A practical position-independent context caching system (MPIC) for multimodal LLM serving.

Selective-attention algorithm (MPICk) that recomputes all text tokens plus k initial image tokens to avoid attention misalignment.

System design: static/dynamic KV libraries, parallel disk-to-GPU KV transfer, and single-step linking to avoid two engine passes.

Empirical evaluation showing ~54% TTFT reduction and up to 2× throughput gains with modest quality trade-offs on common MLLMs and datasets.

Key Findings

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

NumbersTTFT reduced up to 54.1% (Fig.9; §5.2)

Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.

NumbersThroughput improved up to 2.0× (Fig.10; §5.3)

Full reuse can cut TTFT more (up to 69.4%) but causes large quality degradation.

NumbersFull reuse reduces TTFT up to 69.4% but degrades generation quality substantially (Fig.3)

Attention on image tokens is highly concentrated: <5% of image tokens have attention >1e-3; first ~500 tokens account for ~80% of attention.

Numbers<5% tokens >1e-3 attention; first 500 tokens ≈80% (Fig.4)

MPIC processes prefill in a single engine pass using 'dummy cache' + selective replacement, avoiding extra engine invocations used by full-reuse approaches.

NumbersSingle-step process versus two-step full reuse (§4.1, §5.2)

Results

Time-to-First-Token (TTFT) vs prefix caching

Valuedown up to 54.1%

Baselineprefix caching

Throughput (online simulation)

Valueup to 2.0×

BaselineCacheBlend / prefix caching

Quality (GPT-assisted score) loss vs prefix caching

Valuewithin 13.6% loss

Baselineprefix caching

Full reuse TTFT vs prefix caching

Valuedown up to 69.4%

Baselineprefix caching

Who Should Care

What To Try In 7 Days

Precompute and store KV caches for frequently referenced images and set up disk-backed static library.

Implement parallel disk-to-GPU loading for KV caches and benchmark TTFT vs prefix caching on a small workload.

Calibrate k (initial image tokens to recompute) by measuring quality on a held-out set, then deploy MPICk in traffic shadowing.

Optimization Features

Token Efficiency

  • Recompute only text + k initial image tokens

Infra Optimization

  • Layer-wise / parallel transfer from disk to GPU (implemented)
  • Fallback to full compute when disk load fails

System Optimization

  • Disk-backed static/dynamic KV libraries
  • Linker that merges caches position-independently

Inference Optimization

  • Selective attention partial recompute (MPICk)
  • Single-step linking to avoid two engine passes
  • Parallel KV cache load and compute

Reproducibility

Code Urls

  • Supplementary material referenced in paper (authors state code and detailed results provided)

Data Urls

  • MMDU (referenced; Liu et al. 2024d)
  • SparklesEval (referenced; Huang et al. 2024)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Reported quality loss up to ~13.6% in evaluated open-question GPT-score settings.
  • Requires storing large KV caches on disk (single-image KV can reach ~1 GB).
  • Evaluations limited to two LLaVA-1.6 variants and two multimodal datasets.
  • Selecting k (number of initial image tokens to recompute) is a hyperparameter that must be tuned per model/workload.

When Not To Use

  • Workloads with no repeated multimodal references or low cache hit rates.
  • Use-cases requiring zero-tolerance for any quality change.
  • Environments with insufficient disk/IO capacity or where per-item KV storage is infeasible.

Failure Modes

  • Disk-load failures cause fallback to full recompute, increasing latency.
  • Underestimating k can leave attention sinks unaddressed and degrade outputs.
  • Large storage footprint per image raises operational cost and eviction complexity.
  • Incorrect linking across user boundaries could expose data if access controls fail (system assumes per-user separation).

Core Entities

Models

  • LLaVA-1.6-vicuna-7B
  • LLaVA-1.6-mistral-7B
  • InternVL-2.5

Metrics

  • Time-to-First-Token (TTFT)
  • Throughput (tokens/sec)
  • GPT score (quality)

Datasets

  • MMDU
  • SparklesEval

Benchmarks

  • GPT-assisted score (GPT score judge)