Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.
Summary TLDR
MPIC is a system and algorithm that stores KV caches for images and other multimodal content on disk, loads and computes them in parallel, and uses a selective-attention "partial reuse" method to recompute only text tokens plus the first k image tokens. This preserves generation quality while reducing Time-to-First-Token (TTFT) by up to 54.1% vs. prefix caching and improving throughput by up to 2× in online simulations. Evaluations use LLaVA-1.6 variants on MMDU and SparklesEval; accuracy loss is reported within ~13.6% in the tested settings.
Problem Statement
Existing context caching reuses KV caches only for exact prefixes. Small changes at the start of a prompt force full recomputation, which is costly for multimodal prompts where image KV caches are large. Fully reusing cached KV for arbitrary positions breaks autoregressive attention and hurts quality. The problem: enable position-independent reuse for multimodal inputs without large quality loss or extra engine passes.
Main Contribution
A practical position-independent context caching system (MPIC) for multimodal LLM serving.
Selective-attention algorithm (MPICk) that recomputes all text tokens plus k initial image tokens to avoid attention misalignment.
System design: static/dynamic KV libraries, parallel disk-to-GPU KV transfer, and single-step linking to avoid two engine passes.
Empirical evaluation showing ~54% TTFT reduction and up to 2× throughput gains with modest quality trade-offs on common MLLMs and datasets.
Key Findings
MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.
Online simulation shows MPIC doubles throughput versus a state-of-the-art baseline.
Full reuse can cut TTFT more (up to 69.4%) but causes large quality degradation.
Attention on image tokens is highly concentrated: <5% of image tokens have attention >1e-3; first ~500 tokens account for ~80% of attention.
MPIC processes prefill in a single engine pass using 'dummy cache' + selective replacement, avoiding extra engine invocations used by full-reuse approaches.
Results
Time-to-First-Token (TTFT) vs prefix caching
Throughput (online simulation)
Quality (GPT-assisted score) loss vs prefix caching
Full reuse TTFT vs prefix caching
Who Should Care
What To Try In 7 Days
Precompute and store KV caches for frequently referenced images and set up disk-backed static library.
Implement parallel disk-to-GPU loading for KV caches and benchmark TTFT vs prefix caching on a small workload.
Calibrate k (initial image tokens to recompute) by measuring quality on a held-out set, then deploy MPICk in traffic shadowing.
Optimization Features
Token Efficiency
- Recompute only text + k initial image tokens
Infra Optimization
- Layer-wise / parallel transfer from disk to GPU (implemented)
- Fallback to full compute when disk load fails
System Optimization
- Disk-backed static/dynamic KV libraries
- Linker that merges caches position-independently
Inference Optimization
- Selective attention partial recompute (MPICk)
- Single-step linking to avoid two engine passes
- Parallel KV cache load and compute
Reproducibility
Code Urls
- Supplementary material referenced in paper (authors state code and detailed results provided)
Data Urls
- MMDU (referenced; Liu et al. 2024d)
- SparklesEval (referenced; Huang et al. 2024)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Reported quality loss up to ~13.6% in evaluated open-question GPT-score settings.
- Requires storing large KV caches on disk (single-image KV can reach ~1 GB).
- Evaluations limited to two LLaVA-1.6 variants and two multimodal datasets.
- Selecting k (number of initial image tokens to recompute) is a hyperparameter that must be tuned per model/workload.
When Not To Use
- Workloads with no repeated multimodal references or low cache hit rates.
- Use-cases requiring zero-tolerance for any quality change.
- Environments with insufficient disk/IO capacity or where per-item KV storage is infeasible.
Failure Modes
- Disk-load failures cause fallback to full recompute, increasing latency.
- Underestimating k can leave attention sinks unaddressed and degrade outputs.
- Large storage footprint per image raises operational cost and eviction complexity.
- Incorrect linking across user boundaries could expose data if access controls fail (system assumes per-user separation).
Core Entities
Models
- LLaVA-1.6-vicuna-7B
- LLaVA-1.6-mistral-7B
- InternVL-2.5
Metrics
- Time-to-First-Token (TTFT)
- Throughput (tokens/sec)
- GPT score (quality)
Datasets
- MMDU
- SparklesEval
Benchmarks
- GPT-assisted score (GPT score judge)

