Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.85
Citation Count
0
Why It Matters For Business
Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.
Summary TLDR
MMR-Bench is an offline, cost-aware benchmark and dataset of 11k instance–model outcomes designed to evaluate per-query model selection (routing) across multimodal large language models (MLLMs). The benchmark supplies precomputed outputs, normalized costs, and utilities for 10 MLLMs across 8 datasets in OCR, VQA, and visual math. Key findings: multimodal features improve routing over unimodal ones; matrix-factorization (MF) style routers give the most stable cost-accuracy trade-offs; and a routed system can match or beat the best single model at about 33% of its cost on evaluated workloads.
Problem Statement
No single multimodal LLM (MLLM) is best for every multimodal input. Using one model for all queries either wastes compute on easy inputs or loses accuracy on hard ones. Existing routing benchmarks focus on text-only LLMs and do not capture modality fusion, diverse model costs, or budget-aware evaluation for MLLMs. We need a standardized, offline benchmark and protocols to measure how well routing policies trade off cost and accuracy in multimodal workloads.
Main Contribution
MMR-Bench: an offline outcome table with precomputed model outputs, normalized costs and utilities for 11,000 instance–model pairs across 10 MLLMs and 8 datasets (OCR, general VQA, visual math).
A standardized, budget-aware evaluation protocol and three summary metrics: normalized AUC (nAUC), peak score (P_s), and Quality-Neutral Cost (QNC).
A comparison of routing families (non-parametric, regression, MF-style) and fusion strategies, showing multimodal fusion and MF routers yield the best, most robust cost-accuracy frontiers.
Analyses showing adaptive multimodal fusion closes modality gaps, routers generalize across datasets in-scenario, and multimodal-trained routers transfer zero-shot to text-only benchmarks.
Key Findings
Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.
Matrix-factorization style routers give the most stable, high aggregate performance across tasks.
Adaptive multimodal fusion substantially improves routing compared to equal-weight fusion on many routers.
A multimodal-trained router transfers zero-shot to text-only benchmarks and outperforms the best single model there.
Results
nAUC (full dataset, LinearMFRouter)
Peak score P_s (full dataset, LinearMFRouter)
Accuracy
Adaptive fusion gain (KMeans router)
Who Should Care
What To Try In 7 Days
Assemble an offline outcome table for your models on a small in-domain sample (store outputs, normalized costs, and utilities).
Implement a simple MF-style router (low-rank projection + regressors) on fused text/image embeddings and evaluate nAUC and P_s.
Add adaptive fusion (per-modality confidence + product/difference features) and compare to equal-weight fusion on a held-out set.
Agent Features
Frameworks
- adaptive fusion
- frozen embeddings (CLIP)
- low-rank projection
Architectures
- k-means
- k-NN
- linear regression
- MLP
- matrix factorization
- cross-modal attention
Optimization Features
Token Efficiency
- cost normalization across token outputs
Inference Optimization
- model routing
- model cascades
- cost-aware selection
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Offline-only protocol: no adaptive multi-call routing or online decision feedback allowed.
- Fixed cost normalization approximates real cloud pricing and may mismatch live costs.
- Model pool limited to 10 candidate MLLMs; results may shift with different models.
- Benchmarks focus on OCR, VQA, and visual math; other multimodal tasks (video, audio) are not covered.
When Not To Use
- When you require adaptive, multi-call routing that queries multiple models per instance.
- If your production cost model or model pool differs substantially from the paper's normalization.
- For modalities not represented (audio, long video) since MMR-Bench targets text+image.
Failure Modes
- Misrouting when modality signals are misestimated (e.g., low-quality images leading to overconfident cheap-model selection).
- Performance drop if the candidate zoo lacks a truly high-capability model for certain reasoning tasks.
- Cost model drift: normalized offline costs diverge from real deployment pricing and break QNC estimates.
- Clustering or instance-based routers can be brittle under domain shift or small training sets.
Core Entities
Models
- GPT-5 series
- Gemini 2.5 series
- Claude 3.7
- InternVL3-78B
- Qwen2.5-VL-3B
- Qwen2.5-VL-7B
- Qwen2.5-VL-72B
- Gemma3-4B
- GPT-5-Nano
- Gemini 2.5 Flash
Metrics
- nAUC
- Peak score (P_s)
- Quality-Neutral Cost (QNC)
- Accuracy
Datasets
- OCRBench
- SEED-Bench-2-Plus
- MMStar
- RealWorldQA
- MathVista
- MathVerse
- MathVision
Benchmarks
- MMR-Bench
- RouterBench
- RouterEval

