Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Inf-MLLM lets you run continuous multimodal inference on a single GPU, cutting cloud costs and privacy exposure while avoiding out-of-memory failures for long videos and multi-round dialogs.
Summary TLDR
Inf-MLLM is a runtime framework that lets multimodal large language models (MLLMs) perform streaming inference on a single GPU without re-training. It keeps a small, fixed-size KV cache by detecting and preserving "attention saddles" (tokens with high attention), evicting less relevant states, and adding an attention bias to favor recent context. The method reduces memory use, avoids OOM on long video/text streams, preserves long-term dependencies, and shows better perplexity, lower latency at large contexts, and stable multi-round QA up to hundreds of rounds and videos up to an hour on single GPUs (4090D and Orin).
Problem Statement
Streaming multimodal inputs create very long contexts. Standard inference caches all key/value (KV) states, which grows memory and slows attention (quadratic cost). Existing eviction or window methods either lose long-term info or fail on multimodal/video streams. The paper seeks a runtime way to keep inference quality while bounding KV memory on a single GPU without re-training.
Main Contribution
Discovery of "attention saddles": scattered tokens that keep high attention over many decoding steps and matter more than a plain recency window.
A KV-cache eviction policy that keeps recent tokens plus top relevant tokens (attention saddles) using a local-sum attention score over a retrieval window.
An attention-bias term that pushes focus toward newer tokens so the cache updates over multi-round streaming and preserves long-term dependency with tuning.
Demonstration on several LLMs and MLLMs (Vicuna, Pythia, LLaMA-2, Chat-UniVi, Flash-VStream) that enables streaming across very long text (tested to 4M tokens) and long video QA on a single GPU.
Key Findings
Inf-MLLM enables stable language-model perplexity on extremely long text (up to 4 million tokens) and outperforms window/H2O/StreamingLLM on ranges tested.
Average memory during decoding stays lower: Inf-MLLM ~13.5GB vs H2O/StreamingLLM ~13.7GB on a 4090D.
Inf-MLLM prevents OOM and maintains multi-round video QA: e.g., Chat-UniVi at 300 rounds gives 72.7% accuracy vs OOM without Inf-MLLM.
Long-term retrieval-style QA accuracy improves sharply: LLaMA-2-7B-32K reaches ~100% vs 40% for StreamingLLM at a 115-token distance.
Results
Perplexity on long text
Memory usage during decoding
Accuracy
Accuracy
VStream-QA long video handling (300 rounds)
Who Should Care
What To Try In 7 Days
Prototype Inf-MLLM as a runtime wrapper for your MLLM to bound KV cache (use 2K cache as in paper).
Run a few long-stream tests (text or video) and compare OOM, memory, and latency to your current setup.
Tune the attention-bias parameter: start at 0.0001–0.01 and validate retrieval accuracy for distant facts.
Optimization Features
Token Efficiency
- Fixed-size KV cache (example: 2K)
- Selects scattered relevant tokens rather than only recent window
Infra Optimization
- Lower average memory footprint (~13.5GB) vs baselines
- Faster per-token decoding at very long contexts (>40K)
System Optimization
- Single-GPU streaming on 4090D and ORIN
- Does not require model fine-tuning or retraining
Inference Optimization
- KV cache eviction selecting recent + top-r relevant tokens (attention saddles)
- Attention bias to favor new tokens and avoid stale accumulation
- Local-sum attention in a retrieval window for scoring tokens
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires tuning of attention-bias; wrong values can cause model collapse on long contexts (Table 4).
- Some tasks see small drops or redundant descriptions even if answers remain correct (noted in video QA).
- Paper shows experiments on certain 7B models; transfer to much larger or very small models is untested.
When Not To Use
- When you can afford cloud-hosted large models and do not need single-GPU streaming.
- If you cannot tune attention-bias or wish to avoid runtime hyperparameter control.
- For tasks that require full exact historical token retention for auditing or debugging.
Failure Modes
- Model collapse if attention-bias is too small and cache stops updating (Table 4).
- Missing critical distant tokens if top-k selection omits needed tokens due to scoring noise.
- Slight degradation or redundant text generation in some video QA cases.
Core Entities
Models
- Chat-UniVi-7B
- Flash-VStream-7B
- Vicuna-7B
- Pythia-2.8B
- LLaMA-2-7B-32K
Metrics
- perplexity
- Accuracy
- score (GPT-3.5-Turbo evaluator)
- decoding latency (per-token)
- memory usage (GB)
Datasets
- Wiki-Text-103
- LongEval-LineRetrieval
- MSVD-QA
- MSRVTT-QA
- TGIF-QA
- VStream-QA
Benchmarks
- LongEval-LineRetrieval
- VStream-QA
- multi-round video QA (concatenated MSVD/MSRVTT/TGIF)

