Overview
The idea is validated on real hardware and two MLLMs; it is practical but relies on standard KV caching and pre-loading decoder weights on consumer GPUs.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can cut multimodal inference hardware cost by mixing cheap high-FLOPS GPUs for vision and expensive high-bandwidth GPUs for decoding; this reduces transfer needs and preserves latency while improving tokens per dollar.
Who Should Care
Summary TLDR
Multimodal LLM inference has two opposing phases: vision encoding (compute-bound) and language decoding (memory-bandwidth-bound). Cutting the graph at the modality boundary and transferring only visual embeddings (MB-scale) instead of KV caches (GB-scale) enables cross-tier serving (consumer GPUs + datacenter GPUs) over PCIe. The paper proves this transfer is optimal under standard KV caching, builds HeteroServe (embedding-only transfer, cross-type work stealing, engine optimizations), and reports a cost model predicting ~31.4% savings and observed 40.6% savings. Engine optimizations raise throughput up to 54% vs vLLM on identical 4×A100; a $38k heterogeneous cluster improves Tokens/$ by 37%
Problem Statement
Current MLLM serving runs both vision and language phases on the same expensive HBM GPUs. Stage-level partitioning moves GB-scale KV caches and needs NVLink, blocking cheap consumer GPUs on PCIe. We need a partition that minimizes cross-device transfer so consumer GPUs can handle vision encoding and save cost without hurting latency.
Main Contribution
Theorem and analysis showing modality boundary minimizes cross-device transfer under standard KV caching, reducing communication from O(L·s_ctx) to O(N_v·d).
Closed-form cost model predicting heterogeneous (consumer+datacenter GPU) clusters can be cost-optimal; model predicts 31.4% savings and experiments observe 40.6%.
Key Findings
Modality-level partition transfers MB-scale embeddings vs GB-scale KV caches.
Analytic transfer ratio grows with model depth L; modality partition advantage is O(L).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Per-request transfer size (7B example) | KV cache ≈ 350 MB vs visual embedding ≈ 4.5 MB | stage-level (KV cache) | ≈78× reduction | LLaVA-7B (576 vision tokens, 128 output) | Table 2 and Section 4.1 | Table 2 |
| Predicted cost saving (model) | 31.4% predicted saving | homogeneous deployment | — | RTX4090/A100 price assumptions | Section 4.3 | Section 4.3 |
What To Try In 7 Days
Measure per-request KV cache size and visual embedding size for your MLLM to confirm MB vs GB gap.
Prototype sending embeddings: run encoder on a consumer GPU and transfer embedding to an A100 decoder over PCIe.
Enable lazy KV allocation and CUDA Graph capture to reduce runtime overhead before full system changes.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Analysis assumes standard KV caching; activation recomputation or KV offload removes the premise.
Work stealing requires extra VRAM on consumer GPUs (pre-loaded decoder weights) and is bounded by KV cache capacity.
When Not To Use
Text-only LLMs where no compact encoder output exists.
Deployments that already use activation recomputation or KV offloading to avoid KV transfer.
Failure Modes
Embedding transfer stalls or PCIe congestion increasing latency.
Misconfigured work stealing that delays vision tasks (if priority rules are broken).

