MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

January 25, 20268 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides offline outcome tables and standard metrics for repeatable comparison; experimental results across 10 real models and 8 datasets support claims, but applicability depends on how closely your workload matches their scenarios.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 70%

Novelty: 60%

Authors

Haoxuan Ma, Guannan Lai, Han-Jia Ye

Links

Abstract / PDF / Code

Why It Matters For Business

Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.

Who Should Care

Summary TLDR

MMR-Bench is an offline, cost-aware benchmark and dataset of 11k instance–model outcomes designed to evaluate per-query model selection (routing) across multimodal large language models (MLLMs). The benchmark supplies precomputed outputs, normalized costs, and utilities for 10 MLLMs across 8 datasets in OCR, VQA, and visual math. Key findings: multimodal features improve routing over unimodal ones; matrix-factorization (MF) style routers give the most stable cost-accuracy trade-offs; and a routed system can match or beat the best single model at about 33% of its cost on evaluated workloads.

Problem Statement

No single multimodal LLM (MLLM) is best for every multimodal input. Using one model for all queries either wastes compute on easy inputs or loses accuracy on hard ones. Existing routing benchmarks focus on text-only LLMs and do not capture modality fusion, diverse model costs, or budget-aware evaluation for MLLMs. We need a standardized, offline benchmark and protocols to measure how well routing policies trade off cost and accuracy in multimodal workloads.

Main Contribution

MMR-Bench: an offline outcome table with precomputed model outputs, normalized costs and utilities for 11,000 instance–model pairs across 10 MLLMs and 8 datasets (OCR, general VQA, visual math).

A standardized, budget-aware evaluation protocol and three summary metrics: normalized AUC (nAUC), peak score (P_s), and Quality-Neutral Cost (QNC).

Key Findings

Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.

NumbersRouted system ≈ same accuracy as best single model at ~33% cost (Sec.5.3, Fig.4)

Practical UseUse multimodal routing to cut average inference spend by ~2/3 on comparable workloads while keeping top-model accuracy; implement offline outcome tables to test feasibility before live switching.

Evidence RefFigure 4; Sec. 5.3

Matrix-factorization style routers give the most stable, high aggregate performance across tasks.

NumbersLinearMFRouter nAUC=0.7042 and P_s=0.7533 on full dataset average (Table 2)

Practical UseStart with an MF-style router (low-rank projection + regressors) for mixed multimodal workloads to get steady cost-aware gains rather than brittle instance-only methods.

Evidence RefTable 2; Sec. 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
nAUC (full dataset, LinearMFRouter)0.7042macro-average across 8 datasets (Sec.5.3, Table 2)Table 2 reports LinearMFRouter nAUC=0.7042Table 2
Peak score P_s (full dataset, LinearMFRouter)0.7533macro-average across 8 datasets (Sec.5.3, Table 2)Table 2 reports LinearMFRouter P_s=0.7533Table 2

What To Try In 7 Days

Assemble an offline outcome table for your models on a small in-domain sample (store outputs, normalized costs, and utilities).

Implement a simple MF-style router (low-rank projection + regressors) on fused text/image embeddings and evaluate nAUC and P_s.

Add adaptive fusion (per-modality confidence + product/difference features) and compare to equal-weight fusion on a held-out set.

Agent Features

Frameworks
adaptive fusionfrozen embeddings (CLIP)low-rank projection
Architectures
k-meansk-NNlinear regressionMLPmatrix factorizationcross-modal attention

Optimization Features

Token Efficiency
cost normalization across token outputs
Inference Optimization
model routingmodel cascadescost-aware selection

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Offline-only protocol: no adaptive multi-call routing or online decision feedback allowed.

Fixed cost normalization approximates real cloud pricing and may mismatch live costs.

When Not To Use

When you require adaptive, multi-call routing that queries multiple models per instance.

If your production cost model or model pool differs substantially from the paper's normalization.

Failure Modes

Misrouting when modality signals are misestimated (e.g., low-quality images leading to overconfident cheap-model selection).

Performance drop if the candidate zoo lacks a truly high-capability model for certain reasoning tasks.

Core Entities

Models

GPT-5 seriesGemini 2.5 seriesClaude 3.7InternVL3-78BQwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-72BGemma3-4BGPT-5-NanoGemini 2.5 Flash

Metrics

nAUCPeak score (P_s)Quality-Neutral Cost (QNC)Accuracy

Datasets

OCRBenchSEED-Bench-2-PlusMMStarRealWorldQAMathVistaMathVerseMathVision

Benchmarks

MMR-BenchRouterBenchRouterEval