Overview
MPIC is a systems-level solution tested on common MLLMs and datasets, with clear runtime gains; adopt in services that reuse multimodal content after validating k and storage costs.
Citations1
Evidence Strength0.75
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.
Who Should Care
Summary TLDR
MPIC is a system and algorithm that stores KV caches for images and other multimodal content on disk, loads and computes them in parallel, and uses a selective-attention "partial reuse" method to recompute only text tokens plus the first k image tokens. This preserves generation quality while reducing Time-to-First-Token (TTFT) by up to 54.1% vs. prefix caching and improving throughput by up to 2× in online simulations. Evaluations use LLaVA-1.6 variants on MMDU and SparklesEval; accuracy loss is reported within ~13.6% in the tested settings.
Problem Statement
Existing context caching reuses KV caches only for exact prefixes. Small changes at the start of a prompt force full recomputation, which is costly for multimodal prompts where image KV caches are large. Fully reusing cached KV for arbitrary positions breaks autoregressive attention and hurts quality. The problem: enable position-independent reuse for multimodal inputs without large quality loss or extra engine passes.
Main Contribution
A practical position-independent context caching system (MPIC) for multimodal LLM serving.
Selective-attention algorithm (MPICk) that recomputes all text tokens plus k initial image tokens to avoid attention misalignment.
Key Findings
MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.
Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Time-to-First-Token (TTFT) vs prefix caching | down up to 54.1% | prefix caching | −54.1% TTFT | MMDU / LLaVA-1.6 variants (offline) | Fig.9; authors report MPIC-32 reduces TTFT by up to 54.1% (§5.2) | Fig.9, §5.2 |
| Throughput (online simulation) | up to 2.0× | CacheBlend / prefix caching | +2.0× throughput | Simulated request traces from MMDU (vLLM API) (§5.3) | Fig.10; authors report 2.0× improvement in throughput | Fig.10, §5.3 |
What To Try In 7 Days
Precompute and store KV caches for frequently referenced images and set up disk-backed static library.
Implement parallel disk-to-GPU loading for KV caches and benchmark TTFT vs prefix caching on a small workload.
Calibrate k (initial image tokens to recompute) by measuring quality on a held-out set, then deploy MPICk in traffic shadowing.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Reported quality loss up to ~13.6% in evaluated open-question GPT-score settings.
Requires storing large KV caches on disk (single-image KV can reach ~1 GB).
When Not To Use
Workloads with no repeated multimodal references or low cache hit rates.
Use-cases requiring zero-tolerance for any quality change.
Failure Modes
Disk-load failures cause fallback to full recompute, increasing latency.
Underestimating k can leave attention sinks unaddressed and degrade outputs.

