MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

January 25, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.85

Citation Count

0

Authors

Haoxuan Ma, Guannan Lai, Han-Jia Ye

Links

Abstract / PDF

Why It Matters For Business

Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.

Summary TLDR

MMR-Bench is an offline, cost-aware benchmark and dataset of 11k instance–model outcomes designed to evaluate per-query model selection (routing) across multimodal large language models (MLLMs). The benchmark supplies precomputed outputs, normalized costs, and utilities for 10 MLLMs across 8 datasets in OCR, VQA, and visual math. Key findings: multimodal features improve routing over unimodal ones; matrix-factorization (MF) style routers give the most stable cost-accuracy trade-offs; and a routed system can match or beat the best single model at about 33% of its cost on evaluated workloads.

Problem Statement

No single multimodal LLM (MLLM) is best for every multimodal input. Using one model for all queries either wastes compute on easy inputs or loses accuracy on hard ones. Existing routing benchmarks focus on text-only LLMs and do not capture modality fusion, diverse model costs, or budget-aware evaluation for MLLMs. We need a standardized, offline benchmark and protocols to measure how well routing policies trade off cost and accuracy in multimodal workloads.

Main Contribution

MMR-Bench: an offline outcome table with precomputed model outputs, normalized costs and utilities for 11,000 instance–model pairs across 10 MLLMs and 8 datasets (OCR, general VQA, visual math).

A standardized, budget-aware evaluation protocol and three summary metrics: normalized AUC (nAUC), peak score (P_s), and Quality-Neutral Cost (QNC).

A comparison of routing families (non-parametric, regression, MF-style) and fusion strategies, showing multimodal fusion and MF routers yield the best, most robust cost-accuracy frontiers.

Analyses showing adaptive multimodal fusion closes modality gaps, routers generalize across datasets in-scenario, and multimodal-trained routers transfer zero-shot to text-only benchmarks.

Key Findings

Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.

NumbersRouted system ≈ same accuracy as best single model at ~33% cost (Sec.5.3, Fig.4)

Matrix-factorization style routers give the most stable, high aggregate performance across tasks.

NumbersLinearMFRouter nAUC=0.7042 and P_s=0.7533 on full dataset average (Table 2)

Adaptive multimodal fusion substantially improves routing compared to equal-weight fusion on many routers.

NumbersKMeans ∆nAUC = +0.3403 and QNC shift +∞ → 1.0585; smaller gains for Linear/MLP (Table 3)

A multimodal-trained router transfers zero-shot to text-only benchmarks and outperforms the best single model there.

NumbersOn GSM8K/MMLU/ARC P_s: 96.7 vs 94.5, 92.4 vs 91.2, 66.7 vs 65.7 (Table 5)

Results

nAUC (full dataset, LinearMFRouter)

Value0.7042

Peak score P_s (full dataset, LinearMFRouter)

Value0.7533

Accuracy

Value≈0.33 (match at ~33% cost)

BaselineBest single model

Adaptive fusion gain (KMeans router)

Value∆nAUC = +0.3403

Baselineequal-weight fusion

Who Should Care

What To Try In 7 Days

Assemble an offline outcome table for your models on a small in-domain sample (store outputs, normalized costs, and utilities).

Implement a simple MF-style router (low-rank projection + regressors) on fused text/image embeddings and evaluate nAUC and P_s.

Add adaptive fusion (per-modality confidence + product/difference features) and compare to equal-weight fusion on a held-out set.

Agent Features

Frameworks

  • adaptive fusion
  • frozen embeddings (CLIP)
  • low-rank projection

Architectures

  • k-means
  • k-NN
  • linear regression
  • MLP
  • matrix factorization
  • cross-modal attention

Optimization Features

Token Efficiency

  • cost normalization across token outputs

Inference Optimization

  • model routing
  • model cascades
  • cost-aware selection

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Offline-only protocol: no adaptive multi-call routing or online decision feedback allowed.
  • Fixed cost normalization approximates real cloud pricing and may mismatch live costs.
  • Model pool limited to 10 candidate MLLMs; results may shift with different models.
  • Benchmarks focus on OCR, VQA, and visual math; other multimodal tasks (video, audio) are not covered.

When Not To Use

  • When you require adaptive, multi-call routing that queries multiple models per instance.
  • If your production cost model or model pool differs substantially from the paper's normalization.
  • For modalities not represented (audio, long video) since MMR-Bench targets text+image.

Failure Modes

  • Misrouting when modality signals are misestimated (e.g., low-quality images leading to overconfident cheap-model selection).
  • Performance drop if the candidate zoo lacks a truly high-capability model for certain reasoning tasks.
  • Cost model drift: normalized offline costs diverge from real deployment pricing and break QNC estimates.
  • Clustering or instance-based routers can be brittle under domain shift or small training sets.

Core Entities

Models

  • GPT-5 series
  • Gemini 2.5 series
  • Claude 3.7
  • InternVL3-78B
  • Qwen2.5-VL-3B
  • Qwen2.5-VL-7B
  • Qwen2.5-VL-72B
  • Gemma3-4B
  • GPT-5-Nano
  • Gemini 2.5 Flash

Metrics

  • nAUC
  • Peak score (P_s)
  • Quality-Neutral Cost (QNC)
  • Accuracy

Datasets

  • OCRBench
  • SEED-Bench-2-Plus
  • MMStar
  • RealWorldQA
  • MathVista
  • MathVerse
  • MathVision

Benchmarks

  • MMR-Bench
  • RouterBench
  • RouterEval