MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Overview

Decision SnapshotNeeds Validation

The benchmark provides offline outcome tables and standard metrics for repeatable comparison; experimental results across 10 real models and 8 datasets support claims, but applicability depends on how closely your workload matches their scenarios.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 70%

Novelty: 60%

Authors

Haoxuan Ma, Guannan Lai, Han-Jia Ye

Links

Abstract / PDF / Code

Why It Matters For Business

Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

MMR-Bench is an offline, cost-aware benchmark and dataset of 11k instance–model outcomes designed to evaluate per-query model selection (routing) across multimodal large language models (MLLMs). The benchmark supplies precomputed outputs, normalized costs, and utilities for 10 MLLMs across 8 datasets in OCR, VQA, and visual math. Key findings: multimodal features improve routing over unimodal ones; matrix-factorization (MF) style routers give the most stable cost-accuracy trade-offs; and a routed system can match or beat the best single model at about 33% of its cost on evaluated workloads.

Problem Statement

No single multimodal LLM (MLLM) is best for every multimodal input. Using one model for all queries either wastes compute on easy inputs or loses accuracy on hard ones. Existing routing benchmarks focus on text-only LLMs and do not capture modality fusion, diverse model costs, or budget-aware evaluation for MLLMs. We need a standardized, offline benchmark and protocols to measure how well routing policies trade off cost and accuracy in multimodal workloads.

Main Contribution

MMR-Bench: an offline outcome table with precomputed model outputs, normalized costs and utilities for 11,000 instance–model pairs across 10 MLLMs and 8 datasets (OCR, general VQA, visual math).

A standardized, budget-aware evaluation protocol and three summary metrics: normalized AUC (nAUC), peak score (P_s), and Quality-Neutral Cost (QNC).

Key Findings

Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.

NumbersRouted system ≈ same accuracy as best single model at ~33% cost (Sec.5.3, Fig.4)

Practical UseUse multimodal routing to cut average inference spend by ~2/3 on comparable workloads while keeping top-model accuracy; implement offline outcome tables to test feasibility before live switching.

Evidence RefFigure 4; Sec. 5.3

Matrix-factorization style routers give the most stable, high aggregate performance across tasks.

NumbersLinearMFRouter nAUC=0.7042 and P_s=0.7533 on full dataset average (Table 2)

Practical UseStart with an MF-style router (low-rank projection + regressors) for mixed multimodal workloads to get steady cost-aware gains rather than brittle instance-only methods.

Evidence RefTable 2; Sec. 5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
nAUC (full dataset, LinearMFRouter)	0.7042	—	—	macro-average across 8 datasets (Sec.5.3, Table 2)	Table 2 reports LinearMFRouter nAUC=0.7042	Table 2
Peak score P_s (full dataset, LinearMFRouter)	0.7533	—	—	macro-average across 8 datasets (Sec.5.3, Table 2)	Table 2 reports LinearMFRouter P_s=0.7533	Table 2

What To Try In 7 Days

Assemble an offline outcome table for your models on a small in-domain sample (store outputs, normalized costs, and utilities).

Implement a simple MF-style router (low-rank projection + regressors) on fused text/image embeddings and evaluate nAUC and P_s.

Add adaptive fusion (per-modality confidence + product/difference features) and compare to equal-weight fusion on a held-out set.

Agent Features

Frameworks

adaptive fusionfrozen embeddings (CLIP)low-rank projection

Architectures

k-meansk-NNlinear regressionMLPmatrix factorizationcross-modal attention

Optimization Features

Token Efficiency

cost normalization across token outputs

Inference Optimization

model routingmodel cascadescost-aware selection

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Hunter-Wrynn/MMR-Bench

Risks & Boundaries

Limitations

Offline-only protocol: no adaptive multi-call routing or online decision feedback allowed.

Fixed cost normalization approximates real cloud pricing and may mismatch live costs.

When Not To Use

When you require adaptive, multi-call routing that queries multiple models per instance.

If your production cost model or model pool differs substantially from the paper's normalization.

Failure Modes

Misrouting when modality signals are misestimated (e.g., low-quality images leading to overconfident cheap-model selection).

Performance drop if the candidate zoo lacks a truly high-capability model for certain reasoning tasks.

Core Entities

Models

GPT-5 seriesGemini 2.5 seriesClaude 3.7InternVL3-78BQwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-72BGemma3-4BGPT-5-NanoGemini 2.5 Flash

Metrics

nAUCPeak score (P_s)Quality-Neutral Cost (QNC)Accuracy

Datasets

OCRBenchSEED-Bench-2-PlusMMStarRealWorldQAMathVistaMathVerseMathVision

Benchmarks

MMR-BenchRouterBenchRouterEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.

Matrix-factorization style routers give the most stable, high aggregate performance across tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-