RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

March 18, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper supplies a usable dataset, a simple math metric (AIQ) and concrete baselines. It is practical for prototyping routers but lacks latency, throughput, and broad model coverage for production readiness.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.

Who Should Care

Summary TLDR

RouterBench is a purpose-built benchmark and dataset for evaluating multi-LLM routing systems. It packs 405,467 recorded outputs from a mix of open and proprietary models across eight tasks (reasoning, math, coding, conversation, RAG, etc.). The paper adds a simple math view (cost-quality plane, non-decreasing convex hull) and a scalar AIQ metric to compare routers. Baselines (KNN/MLP predictive routers and cascading routers) show routing can save money and match top-model accuracy, but gains depend on judge quality and task.

Problem Statement

There is no standard way to compare systems that pick which LLM to call per request. Builders lack a common dataset, a cost-aware evaluation metric, and broad baselines to judge routing decisions.

Main Contribution

A large, public ROUTERBENCH dataset: 405,467 model responses collected across 11+ LLMs and 8 task families to enable router training and offline evaluation.

A clear cost–quality math framework and a single-score metric (AIQ) based on a non-decreasing convex hull to compare routers over a cost range.

Key Findings

ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.

Numbers405,467 samples; 11+ models; 8 datasets

Practical UseYou can train and test routers offline without re-running expensive model calls; use the dataset to prototype routing policies quickly.

Evidence RefSection 4.3, A.3, Table 1

An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.

NumbersExample: Oracle MMLU 0.957 @ $0.297 vs GPT-4 0.828 @ $4.086

Practical UsePer-input model choice can cut service cost dramatically while keeping or improving quality; aim to approximate an Oracle with routing logic.

Evidence RefTable 1, Section 4.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size405,467 samplesROUTERBENCH (all)Full dataset collected across tasks and modelsSection 4.3
Oracle vs GPT-4 on MMLUOracle perf 0.957 @ $0.297; GPT-4 perf 0.828 @ $4.086GPT-4perf +0.129; cost −$3.789MMLU (table values)Table 1 per-model resultsA.5 Table 1

What To Try In 7 Days

Run ROUTERBENCH on your model fleet to map cost vs. quality per task.

Train a simple KNN router on labeled per-model outcomes and test on a held-out split.

Prototype a cascading flow with a lightweight judge and measure judge error impact.

Optimization Features

Model Optimization
Model routing (per-input model choice)Model cascades (try cheap -> expensive)
System Optimization
Exploiting cheaper models to lower total inference dollars
Training Optimization
Supervised performance predictors (KNN, MLP trained on embeddings)
Inference Optimization
Zero-router convex hull interpolation to pick probabilistic mixes

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses only on performance and dollar cost; omits latency and throughput.

Not all LLMs and task types are covered; future updates needed to keep current.

When Not To Use

When latency or throughput constraints are primary optimization targets.

When you need routers evaluated on a much larger or different model pool than provided.

Failure Modes

Cascading routers fail if the quality judge error > ~0.2.

Predictive routers may not generalize across tasks and can underperform on some benchmarks.

Core Entities

Models

Llama-70B-chatMixtral-8x7B-chatYi-34B-chatCode Llama-34BMistral-7B-chatWizardLM-13BGPT-4GPT-3.5-turboClaude-instant-v1Claude-v1Claude-v2You.com APIsonar-small-onlinesonar-medium-online

Metrics

AccuracyAIQ (Average Improvement in Quality)Cost (USD per request)Non-decreasing convex hull (NDCH)

Datasets

HellaSwagWinograndeARC ChallengeMMLUMT-BenchGSM8KMBPPRAG (800 client queries)

Benchmarks

ROUTERBENCH (this work)