Inf-MLLM: keep multimodal LLMs streaming on a single GPU by caching only recent + relevant tokens

September 11, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

Links

Abstract / PDF

Why It Matters For Business

Inf-MLLM lets you run continuous multimodal inference on a single GPU, cutting cloud costs and privacy exposure while avoiding out-of-memory failures for long videos and multi-round dialogs.

Summary TLDR

Inf-MLLM is a runtime framework that lets multimodal large language models (MLLMs) perform streaming inference on a single GPU without re-training. It keeps a small, fixed-size KV cache by detecting and preserving "attention saddles" (tokens with high attention), evicting less relevant states, and adding an attention bias to favor recent context. The method reduces memory use, avoids OOM on long video/text streams, preserves long-term dependencies, and shows better perplexity, lower latency at large contexts, and stable multi-round QA up to hundreds of rounds and videos up to an hour on single GPUs (4090D and Orin).

Problem Statement

Streaming multimodal inputs create very long contexts. Standard inference caches all key/value (KV) states, which grows memory and slows attention (quadratic cost). Existing eviction or window methods either lose long-term info or fail on multimodal/video streams. The paper seeks a runtime way to keep inference quality while bounding KV memory on a single GPU without re-training.

Main Contribution

Discovery of "attention saddles": scattered tokens that keep high attention over many decoding steps and matter more than a plain recency window.

A KV-cache eviction policy that keeps recent tokens plus top relevant tokens (attention saddles) using a local-sum attention score over a retrieval window.

An attention-bias term that pushes focus toward newer tokens so the cache updates over multi-round streaming and preserves long-term dependency with tuning.

Demonstration on several LLMs and MLLMs (Vicuna, Pythia, LLaMA-2, Chat-UniVi, Flash-VStream) that enables streaming across very long text (tested to 4M tokens) and long video QA on a single GPU.

Key Findings

Inf-MLLM enables stable language-model perplexity on extremely long text (up to 4 million tokens) and outperforms window/H2O/StreamingLLM on ranges tested.

Numberstested up to 4,000,000 tokens; better PPL than baselines up to 20K (Fig.5)

Average memory during decoding stays lower: Inf-MLLM ~13.5GB vs H2O/StreamingLLM ~13.7GB on a 4090D.

Numbersavg memory ≈13.5GB vs ≈13.7GB (Fig.7 / Sec 4.5)

Inf-MLLM prevents OOM and maintains multi-round video QA: e.g., Chat-UniVi at 300 rounds gives 72.7% accuracy vs OOM without Inf-MLLM.

NumbersChat-UniVi 300-round accuracy 72.7% vs OOM baseline (Table 1)

Long-term retrieval-style QA accuracy improves sharply: LLaMA-2-7B-32K reaches ~100% vs 40% for StreamingLLM at a 115-token distance.

NumbersToken distance 115: StrLLM 0.40 -> Inf-MLLM 1.00 (Table 2)

Results

Perplexity on long text

Valuestable up to 4,000,000 tokens; lower log-PPL than baselines to 20K

Baselinewindow attention / H2O / StreamingLLM

Memory usage during decoding

Value≈13.5 GB average

BaselineH2O and StreamingLLM ≈13.7 GB

Accuracy

ValueChat-UniVi w/ Inf-MLLM: 72.7% (MSVD-QA, 300 rounds)

BaselineChat-UniVi w/o Inf-MLLM: OOM

Accuracy

ValueLLaMA-2-7B-32K: 1.00 at token distance 115

BaselineStreamingLLM: 0.40 at token distance 115

VStream-QA long video handling (300 rounds)

ValueChat-UniVi w/ Inf-MLLM: 37.7% accuracy on 67 min video

BaselineChat-UniVi w/o Inf-MLLM: OOM

Who Should Care

What To Try In 7 Days

Prototype Inf-MLLM as a runtime wrapper for your MLLM to bound KV cache (use 2K cache as in paper).

Run a few long-stream tests (text or video) and compare OOM, memory, and latency to your current setup.

Tune the attention-bias parameter: start at 0.0001–0.01 and validate retrieval accuracy for distant facts.

Optimization Features

Token Efficiency

  • Fixed-size KV cache (example: 2K)
  • Selects scattered relevant tokens rather than only recent window

Infra Optimization

  • Lower average memory footprint (~13.5GB) vs baselines
  • Faster per-token decoding at very long contexts (>40K)

System Optimization

  • Single-GPU streaming on 4090D and ORIN
  • Does not require model fine-tuning or retraining

Inference Optimization

  • KV cache eviction selecting recent + top-r relevant tokens (attention saddles)
  • Attention bias to favor new tokens and avoid stale accumulation
  • Local-sum attention in a retrieval window for scoring tokens

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires tuning of attention-bias; wrong values can cause model collapse on long contexts (Table 4).
  • Some tasks see small drops or redundant descriptions even if answers remain correct (noted in video QA).
  • Paper shows experiments on certain 7B models; transfer to much larger or very small models is untested.

When Not To Use

  • When you can afford cloud-hosted large models and do not need single-GPU streaming.
  • If you cannot tune attention-bias or wish to avoid runtime hyperparameter control.
  • For tasks that require full exact historical token retention for auditing or debugging.

Failure Modes

  • Model collapse if attention-bias is too small and cache stops updating (Table 4).
  • Missing critical distant tokens if top-k selection omits needed tokens due to scoring noise.
  • Slight degradation or redundant text generation in some video QA cases.

Core Entities

Models

  • Chat-UniVi-7B
  • Flash-VStream-7B
  • Vicuna-7B
  • Pythia-2.8B
  • LLaMA-2-7B-32K

Metrics

  • perplexity
  • Accuracy
  • score (GPT-3.5-Turbo evaluator)
  • decoding latency (per-token)
  • memory usage (GB)

Datasets

  • Wiki-Text-103
  • LongEval-LineRetrieval
  • MSVD-QA
  • MSRVTT-QA
  • TGIF-QA
  • VStream-QA

Benchmarks

  • LongEval-LineRetrieval
  • VStream-QA
  • multi-round video QA (concatenated MSVD/MSRVTT/TGIF)