Overview
The benchmark provides offline outcome tables and standard metrics for repeatable comparison; experimental results across 10 real models and 8 datasets support claims, but applicability depends on how closely your workload matches their scenarios.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 85%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.
Who Should Care
Summary TLDR
MMR-Bench is an offline, cost-aware benchmark and dataset of 11k instance–model outcomes designed to evaluate per-query model selection (routing) across multimodal large language models (MLLMs). The benchmark supplies precomputed outputs, normalized costs, and utilities for 10 MLLMs across 8 datasets in OCR, VQA, and visual math. Key findings: multimodal features improve routing over unimodal ones; matrix-factorization (MF) style routers give the most stable cost-accuracy trade-offs; and a routed system can match or beat the best single model at about 33% of its cost on evaluated workloads.
Problem Statement
No single multimodal LLM (MLLM) is best for every multimodal input. Using one model for all queries either wastes compute on easy inputs or loses accuracy on hard ones. Existing routing benchmarks focus on text-only LLMs and do not capture modality fusion, diverse model costs, or budget-aware evaluation for MLLMs. We need a standardized, offline benchmark and protocols to measure how well routing policies trade off cost and accuracy in multimodal workloads.
Main Contribution
MMR-Bench: an offline outcome table with precomputed model outputs, normalized costs and utilities for 11,000 instance–model pairs across 10 MLLMs and 8 datasets (OCR, general VQA, visual math).
A standardized, budget-aware evaluation protocol and three summary metrics: normalized AUC (nAUC), peak score (P_s), and Quality-Neutral Cost (QNC).
Key Findings
Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.
Matrix-factorization style routers give the most stable, high aggregate performance across tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| nAUC (full dataset, LinearMFRouter) | 0.7042 | — | — | macro-average across 8 datasets (Sec.5.3, Table 2) | Table 2 reports LinearMFRouter nAUC=0.7042 | Table 2 |
| Peak score P_s (full dataset, LinearMFRouter) | 0.7533 | — | — | macro-average across 8 datasets (Sec.5.3, Table 2) | Table 2 reports LinearMFRouter P_s=0.7533 | Table 2 |
What To Try In 7 Days
Assemble an offline outcome table for your models on a small in-domain sample (store outputs, normalized costs, and utilities).
Implement a simple MF-style router (low-rank projection + regressors) on fused text/image embeddings and evaluate nAUC and P_s.
Add adaptive fusion (per-modality confidence + product/difference features) and compare to equal-weight fusion on a held-out set.
Agent Features
Frameworks
Architectures
Optimization Features
Token Efficiency
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Offline-only protocol: no adaptive multi-call routing or online decision feedback allowed.
Fixed cost normalization approximates real cloud pricing and may mismatch live costs.
When Not To Use
When you require adaptive, multi-call routing that queries multiple models per instance.
If your production cost model or model pool differs substantially from the paper's normalization.
Failure Modes
Misrouting when modality signals are misestimated (e.g., low-quality images leading to overconfident cheap-model selection).
Performance drop if the candidate zoo lacks a truly high-capability model for certain reasoning tasks.

