Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

March 13, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Donglin Yu

Links

Abstract / PDF

Why It Matters For Business

You can cut multimodal inference hardware cost by mixing cheap high-FLOPS GPUs for vision and expensive high-bandwidth GPUs for decoding; this reduces transfer needs and preserves latency while improving tokens per dollar.

Summary TLDR

Multimodal LLM inference has two opposing phases: vision encoding (compute-bound) and language decoding (memory-bandwidth-bound). Cutting the graph at the modality boundary and transferring only visual embeddings (MB-scale) instead of KV caches (GB-scale) enables cross-tier serving (consumer GPUs + datacenter GPUs) over PCIe. The paper proves this transfer is optimal under standard KV caching, builds HeteroServe (embedding-only transfer, cross-type work stealing, engine optimizations), and reports a cost model predicting ~31.4% savings and observed 40.6% savings. Engine optimizations raise throughput up to 54% vs vLLM on identical 4×A100; a $38k heterogeneous cluster improves Tokens/$ by 37%

Problem Statement

Current MLLM serving runs both vision and language phases on the same expensive HBM GPUs. Stage-level partitioning moves GB-scale KV caches and needs NVLink, blocking cheap consumer GPUs on PCIe. We need a partition that minimizes cross-device transfer so consumer GPUs can handle vision encoding and save cost without hurting latency.

Main Contribution

Theorem and analysis showing modality boundary minimizes cross-device transfer under standard KV caching, reducing communication from O(L·s_ctx) to O(N_v·d).

Closed-form cost model predicting heterogeneous (consumer+datacenter GPU) clusters can be cost-optimal; model predicts 31.4% savings and experiments observe 40.6%.

HeteroServe runtime: modality-level partitioning, embedding-only transfer over PCIe, cross-type work stealing, and several engine optimizations.

Empirical validation on LLaVA-1.5-7B and Qwen2.5-VL: up to 54% throughput gain vs vLLM on identical hardware and 37% Tokens/$ improvement under a fixed budget.

Key Findings

Modality-level partition transfers MB-scale embeddings vs GB-scale KV caches.

Numbers7B example: KV ≈ 350 MB vs embedding ≈ 4.5 MB (∼78×)

Analytic transfer ratio grows with model depth L; modality partition advantage is O(L).

NumbersRepresentative ratios 64×–196× across models; scales ∝ L

Heterogeneous deployment predicted and observed to reduce hardware cost.

NumbersModel predicts 31.4% cost saving; observed 40.6% in experiments

Engine optimizations significantly raise throughput independent of architecture choice.

NumbersIdentical 4×A100: +54% tok/s vs vLLM after CUDA Graph + Flash Attention + aligned batches

Work stealing recovers idle consumer-GPU capacity and boosts throughput.

NumbersEnabling cross-type stealing gives 1.13× speedup (3,156 vs 2,793 tok/s)

Results

Per-request transfer size (7B example)

ValueKV cache ≈ 350 MB vs visual embedding ≈ 4.5 MB

Baselinestage-level (KV cache)

Predicted cost saving (model)

Value31.4% predicted saving

Baselinehomogeneous deployment

Observed cost saving (experiment)

Value40.6% observed hardware cost reduction

Baseline$64k homogeneous baseline

Throughput uplift vs vLLM (identical 4×A100)

Valueup to +54% tok/s

BaselinevLLM v0.3.0

Tokens per dollar (CER) improvement under fixed budget

Value+37% CER

Baselinehomogeneous 4×A100 ($64k)

PCIe transfer overhead

Value≈0.18 ms per 4.5 MB; measured 2.5% of end-to-end time

Baselinevision encoding time

Who Should Care

What To Try In 7 Days

Measure per-request KV cache size and visual embedding size for your MLLM to confirm MB vs GB gap.

Prototype sending embeddings: run encoder on a consumer GPU and transfer embedding to an A100 decoder over PCIe.

Enable lazy KV allocation and CUDA Graph capture to reduce runtime overhead before full system changes.

Optimization Features

Token Efficiency

  • packed prefill reduces padding waste up to 63%

Infra Optimization

  • use RTX 4090 for vision encoder and A100 for decoder
  • embedding transfer over PCIe Gen4 x16

System Optimization

  • aligned batch handoff (B_align=32)
  • streaming embedding transfer via pinned CPU memory
  • timeout for under-filled batches (500 ms)
  • bounded work-stealing thresholds (τ=16)

Inference Optimization

  • modality-level partitioning (embed-only transfer)
  • cross-type work stealing (bounded assistance)
  • CUDA Graph capture
  • Flash Attention varlen packed prefill
  • lazy KV cache allocation
  • tensor-parallel decoding support

Reproducibility

Data Urls

  • COCO 2017 validation (mentioned in paper)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Analysis assumes standard KV caching; activation recomputation or KV offload removes the premise.
  • Work stealing requires extra VRAM on consumer GPUs (pre-loaded decoder weights) and is bounded by KV cache capacity.
  • Benefits shrink for text-only workloads or when embeddings and KV sizes are similar.

When Not To Use

  • Text-only LLMs where no compact encoder output exists.
  • Deployments that already use activation recomputation or KV offloading to avoid KV transfer.
  • Environments with heavily contended PCIe bandwidth or strict single-device policies.

Failure Modes

  • Embedding transfer stalls or PCIe congestion increasing latency.
  • Misconfigured work stealing that delays vision tasks (if priority rules are broken).
  • Dynamic visual token variability causing buffer/packing inefficiencies and padding waste.

Core Entities

Models

  • LLaVA-1.5-7B
  • Qwen2.5-VL

Metrics

  • tokens/s
  • Cost-Efficiency Ratio (CER) = tok/s per $1k
  • transfer size (MB/GB)
  • throughput (tok/s)

Datasets

  • COCO 2017 validation