Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cut multimodal inference hardware cost by mixing cheap high-FLOPS GPUs for vision and expensive high-bandwidth GPUs for decoding; this reduces transfer needs and preserves latency while improving tokens per dollar.
Summary TLDR
Multimodal LLM inference has two opposing phases: vision encoding (compute-bound) and language decoding (memory-bandwidth-bound). Cutting the graph at the modality boundary and transferring only visual embeddings (MB-scale) instead of KV caches (GB-scale) enables cross-tier serving (consumer GPUs + datacenter GPUs) over PCIe. The paper proves this transfer is optimal under standard KV caching, builds HeteroServe (embedding-only transfer, cross-type work stealing, engine optimizations), and reports a cost model predicting ~31.4% savings and observed 40.6% savings. Engine optimizations raise throughput up to 54% vs vLLM on identical 4×A100; a $38k heterogeneous cluster improves Tokens/$ by 37%
Problem Statement
Current MLLM serving runs both vision and language phases on the same expensive HBM GPUs. Stage-level partitioning moves GB-scale KV caches and needs NVLink, blocking cheap consumer GPUs on PCIe. We need a partition that minimizes cross-device transfer so consumer GPUs can handle vision encoding and save cost without hurting latency.
Main Contribution
Theorem and analysis showing modality boundary minimizes cross-device transfer under standard KV caching, reducing communication from O(L·s_ctx) to O(N_v·d).
Closed-form cost model predicting heterogeneous (consumer+datacenter GPU) clusters can be cost-optimal; model predicts 31.4% savings and experiments observe 40.6%.
HeteroServe runtime: modality-level partitioning, embedding-only transfer over PCIe, cross-type work stealing, and several engine optimizations.
Empirical validation on LLaVA-1.5-7B and Qwen2.5-VL: up to 54% throughput gain vs vLLM on identical hardware and 37% Tokens/$ improvement under a fixed budget.
Key Findings
Modality-level partition transfers MB-scale embeddings vs GB-scale KV caches.
Analytic transfer ratio grows with model depth L; modality partition advantage is O(L).
Heterogeneous deployment predicted and observed to reduce hardware cost.
Engine optimizations significantly raise throughput independent of architecture choice.
Work stealing recovers idle consumer-GPU capacity and boosts throughput.
Results
Per-request transfer size (7B example)
Predicted cost saving (model)
Observed cost saving (experiment)
Throughput uplift vs vLLM (identical 4×A100)
Tokens per dollar (CER) improvement under fixed budget
PCIe transfer overhead
Who Should Care
What To Try In 7 Days
Measure per-request KV cache size and visual embedding size for your MLLM to confirm MB vs GB gap.
Prototype sending embeddings: run encoder on a consumer GPU and transfer embedding to an A100 decoder over PCIe.
Enable lazy KV allocation and CUDA Graph capture to reduce runtime overhead before full system changes.
Optimization Features
Token Efficiency
- packed prefill reduces padding waste up to 63%
Infra Optimization
- use RTX 4090 for vision encoder and A100 for decoder
- embedding transfer over PCIe Gen4 x16
System Optimization
- aligned batch handoff (B_align=32)
- streaming embedding transfer via pinned CPU memory
- timeout for under-filled batches (500 ms)
- bounded work-stealing thresholds (τ=16)
Inference Optimization
- modality-level partitioning (embed-only transfer)
- cross-type work stealing (bounded assistance)
- CUDA Graph capture
- Flash Attention varlen packed prefill
- lazy KV cache allocation
- tensor-parallel decoding support
Reproducibility
Data Urls
- COCO 2017 validation (mentioned in paper)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Analysis assumes standard KV caching; activation recomputation or KV offload removes the premise.
- Work stealing requires extra VRAM on consumer GPUs (pre-loaded decoder weights) and is bounded by KV cache capacity.
- Benefits shrink for text-only workloads or when embeddings and KV sizes are similar.
When Not To Use
- Text-only LLMs where no compact encoder output exists.
- Deployments that already use activation recomputation or KV offloading to avoid KV transfer.
- Environments with heavily contended PCIe bandwidth or strict single-device policies.
Failure Modes
- Embedding transfer stalls or PCIe congestion increasing latency.
- Misconfigured work stealing that delays vision tasks (if priority rules are broken).
- Dynamic visual token variability causing buffer/packing inefficiencies and padding waste.
Core Entities
Models
- LLaVA-1.5-7B
- Qwen2.5-VL
Metrics
- tokens/s
- Cost-Efficiency Ratio (CER) = tok/s per $1k
- transfer size (MB/GB)
- throughput (tok/s)
Datasets
- COCO 2017 validation

