Overview
The paper supplies a usable dataset, a simple math metric (AIQ) and concrete baselines. It is practical for prototyping routers but lacks latency, throughput, and broad model coverage for production readiness.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.
Who Should Care
Summary TLDR
RouterBench is a purpose-built benchmark and dataset for evaluating multi-LLM routing systems. It packs 405,467 recorded outputs from a mix of open and proprietary models across eight tasks (reasoning, math, coding, conversation, RAG, etc.). The paper adds a simple math view (cost-quality plane, non-decreasing convex hull) and a scalar AIQ metric to compare routers. Baselines (KNN/MLP predictive routers and cascading routers) show routing can save money and match top-model accuracy, but gains depend on judge quality and task.
Problem Statement
There is no standard way to compare systems that pick which LLM to call per request. Builders lack a common dataset, a cost-aware evaluation metric, and broad baselines to judge routing decisions.
Main Contribution
A large, public ROUTERBENCH dataset: 405,467 model responses collected across 11+ LLMs and 8 task families to enable router training and offline evaluation.
A clear cost–quality math framework and a single-score metric (AIQ) based on a non-decreasing convex hull to compare routers over a cost range.
Key Findings
ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.
An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 405,467 samples | — | — | ROUTERBENCH (all) | Full dataset collected across tasks and models | Section 4.3 |
| Oracle vs GPT-4 on MMLU | Oracle perf 0.957 @ $0.297; GPT-4 perf 0.828 @ $4.086 | GPT-4 | perf +0.129; cost −$3.789 | MMLU (table values) | Table 1 per-model results | A.5 Table 1 |
What To Try In 7 Days
Run ROUTERBENCH on your model fleet to map cost vs. quality per task.
Train a simple KNN router on labeled per-model outcomes and test on a held-out split.
Prototype a cascading flow with a lightweight judge and measure judge error impact.
Optimization Features
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Focuses only on performance and dollar cost; omits latency and throughput.
Not all LLMs and task types are covered; future updates needed to keep current.
When Not To Use
When latency or throughput constraints are primary optimization targets.
When you need routers evaluated on a much larger or different model pool than provided.
Failure Modes
Cascading routers fail if the quality judge error > ~0.2.
Predictive routers may not generalize across tasks and can underperform on some benchmarks.

