Inf-MLLM: keep multimodal LLMs streaming on a single GPU by caching only recent + relevant tokens

Overview

Decision SnapshotNeeds Validation

The paper provides clear algorithmic steps and experiments on several models and GPUs, but lacks released code and broad third-party replication; tuning (attention bias, cache size) is required for stable results.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

Links

Abstract / PDF

Why It Matters For Business

Inf-MLLM lets you run continuous multimodal inference on a single GPU, cutting cloud costs and privacy exposure while avoiding out-of-memory failures for long videos and multi-round dialogs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Inf-MLLM is a runtime framework that lets multimodal large language models (MLLMs) perform streaming inference on a single GPU without re-training. It keeps a small, fixed-size KV cache by detecting and preserving "attention saddles" (tokens with high attention), evicting less relevant states, and adding an attention bias to favor recent context. The method reduces memory use, avoids OOM on long video/text streams, preserves long-term dependencies, and shows better perplexity, lower latency at large contexts, and stable multi-round QA up to hundreds of rounds and videos up to an hour on single GPUs (4090D and Orin).

Problem Statement

Streaming multimodal inputs create very long contexts. Standard inference caches all key/value (KV) states, which grows memory and slows attention (quadratic cost). Existing eviction or window methods either lose long-term info or fail on multimodal/video streams. The paper seeks a runtime way to keep inference quality while bounding KV memory on a single GPU without re-training.

Main Contribution

Discovery of "attention saddles": scattered tokens that keep high attention over many decoding steps and matter more than a plain recency window.

A KV-cache eviction policy that keeps recent tokens plus top relevant tokens (attention saddles) using a local-sum attention score over a retrieval window.

Key Findings

Inf-MLLM enables stable language-model perplexity on extremely long text (up to 4 million tokens) and outperforms window/H2O/StreamingLLM on ranges tested.

Numberstested up to 4,000,000 tokens; better PPL than baselines up to 20K (Fig.5)

Practical UseYou can run long-document or multi-round text streams without retraining; expect lower perplexity than common eviction/window baselines when context grows.

Evidence RefFigure 5, Section 4.1

Average memory during decoding stays lower: Inf-MLLM ~13.5GB vs H2O/StreamingLLM ~13.7GB on a 4090D.

Numbersavg memory ≈13.5GB vs ≈13.7GB (Fig.7 / Sec 4.5)

Practical UseOn a single high-end GPU you get small but meaningful memory savings that help avoid OOM on long multimodal streams.

Evidence RefSection 4.5, Fig.7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity on long text	stable up to 4,000,000 tokens; lower log-PPL than baselines to 20K	window attention / H2O / StreamingLLM	lower PPL vs baselines (Fig.5)	Wiki-Text-103 and extended text streams	Figure 5 and Section 4.1	Fig.5
Memory usage during decoding	≈13.5 GB average	H2O and StreamingLLM ≈13.7 GB	≈0.2 GB lower	Vicuna-7B on NVIDIA 4090D (Sec 4.5)	Section 4.5, Fig.7	Section 4.5

What To Try In 7 Days

Prototype Inf-MLLM as a runtime wrapper for your MLLM to bound KV cache (use 2K cache as in paper).

Run a few long-stream tests (text or video) and compare OOM, memory, and latency to your current setup.

Tune the attention-bias parameter: start at 0.0001–0.01 and validate retrieval accuracy for distant facts.

Optimization Features

Token Efficiency

Fixed-size KV cache (example: 2K)Selects scattered relevant tokens rather than only recent window

Infra Optimization

Lower average memory footprint (~13.5GB) vs baselinesFaster per-token decoding at very long contexts (>40K)

System Optimization

Single-GPU streaming on 4090D and ORINDoes not require model fine-tuning or retraining

Inference Optimization

KV cache eviction selecting recent + top-r relevant tokens (attention saddles)Attention bias to favor new tokens and avoid stale accumulationLocal-sum attention in a retrieval window for scoring tokens

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires tuning of attention-bias; wrong values can cause model collapse on long contexts (Table 4).

Some tasks see small drops or redundant descriptions even if answers remain correct (noted in video QA).

When Not To Use

When you can afford cloud-hosted large models and do not need single-GPU streaming.

If you cannot tune attention-bias or wish to avoid runtime hyperparameter control.

Failure Modes

Model collapse if attention-bias is too small and cache stops updating (Table 4).

Missing critical distant tokens if top-k selection omits needed tokens due to scoring noise.

Core Entities

Models

Chat-UniVi-7BFlash-VStream-7BVicuna-7BPythia-2.8BLLaMA-2-7B-32K

Metrics

perplexityAccuracyscore (GPT-3.5-Turbo evaluator)decoding latency (per-token)memory usage (GB)

Datasets

Wiki-Text-103LongEval-LineRetrievalMSVD-QAMSRVTT-QATGIF-QAVStream-QA

Benchmarks

LongEval-LineRetrievalVStream-QAmulti-round video QA (concatenated MSVD/MSRVTT/TGIF)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Inf-MLLM enables stable language-model perplexity on extremely long text (up to 4 million tokens) and outperforms window/H2O/StreamingLLM on ranges tested.

Average memory during decoding stays lower: Inf-MLLM ~13.5GB vs H2O/StreamingLLM ~13.7GB on a 4090D.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding