RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Overview

Decision SnapshotReady For Pilot

The paper supplies a usable dataset, a simple math metric (AIQ) and concrete baselines. It is practical for prototyping routers but lacks latency, throughput, and broad model coverage for production readiness.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

RouterBench is a purpose-built benchmark and dataset for evaluating multi-LLM routing systems. It packs 405,467 recorded outputs from a mix of open and proprietary models across eight tasks (reasoning, math, coding, conversation, RAG, etc.). The paper adds a simple math view (cost-quality plane, non-decreasing convex hull) and a scalar AIQ metric to compare routers. Baselines (KNN/MLP predictive routers and cascading routers) show routing can save money and match top-model accuracy, but gains depend on judge quality and task.

Problem Statement

There is no standard way to compare systems that pick which LLM to call per request. Builders lack a common dataset, a cost-aware evaluation metric, and broad baselines to judge routing decisions.

Main Contribution

A large, public ROUTERBENCH dataset: 405,467 model responses collected across 11+ LLMs and 8 task families to enable router training and offline evaluation.

A clear cost–quality math framework and a single-score metric (AIQ) based on a non-decreasing convex hull to compare routers over a cost range.

Key Findings

ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.

Numbers405,467 samples; 11+ models; 8 datasets

Practical UseYou can train and test routers offline without re-running expensive model calls; use the dataset to prototype routing policies quickly.

Evidence RefSection 4.3, A.3, Table 1

An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.

NumbersExample: Oracle MMLU 0.957 @ $0.297 vs GPT-4 0.828 @ $4.086

Practical UsePer-input model choice can cut service cost dramatically while keeping or improving quality; aim to approximate an Oracle with routing logic.

Evidence RefTable 1, Section 4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	405,467 samples	—	—	ROUTERBENCH (all)	Full dataset collected across tasks and models	Section 4.3
Oracle vs GPT-4 on MMLU	Oracle perf 0.957 @ $0.297; GPT-4 perf 0.828 @ $4.086	GPT-4	perf +0.129; cost −$3.789	MMLU (table values)	Table 1 per-model results	A.5 Table 1

What To Try In 7 Days

Run ROUTERBENCH on your model fleet to map cost vs. quality per task.

Train a simple KNN router on labeled per-model outcomes and test on a held-out split.

Prototype a cascading flow with a lightweight judge and measure judge error impact.

Optimization Features

Model Optimization

Model routing (per-input model choice)Model cascades (try cheap -> expensive)

System Optimization

Exploiting cheaper models to lower total inference dollars

Training Optimization

Supervised performance predictors (KNN, MLP trained on embeddings)

Inference Optimization

Zero-router convex hull interpolation to pick probabilistic mixes

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/withmartian/routerbench

Data URLs

https://github.com/withmartian/routerbench

Risks & Boundaries

Limitations

Focuses only on performance and dollar cost; omits latency and throughput.

Not all LLMs and task types are covered; future updates needed to keep current.

When Not To Use

When latency or throughput constraints are primary optimization targets.

When you need routers evaluated on a much larger or different model pool than provided.

Failure Modes

Cascading routers fail if the quality judge error > ~0.2.

Predictive routers may not generalize across tasks and can underperform on some benchmarks.

Core Entities

Models

Llama-70B-chatMixtral-8x7B-chatYi-34B-chatCode Llama-34BMistral-7B-chatWizardLM-13BGPT-4GPT-3.5-turboClaude-instant-v1Claude-v1Claude-v2You.com APIsonar-small-onlinesonar-medium-online

Metrics

AccuracyAIQ (Average Improvement in Quality)Cost (USD per request)Non-decreasing convex hull (NDCH)

Datasets

HellaSwagWinograndeARC ChallengeMMLUMT-BenchGSM8KMBPPRAG (800 client queries)

Benchmarks

ROUTERBENCH (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.

An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding