Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.
Summary TLDR
RouterBench is a purpose-built benchmark and dataset for evaluating multi-LLM routing systems. It packs 405,467 recorded outputs from a mix of open and proprietary models across eight tasks (reasoning, math, coding, conversation, RAG, etc.). The paper adds a simple math view (cost-quality plane, non-decreasing convex hull) and a scalar AIQ metric to compare routers. Baselines (KNN/MLP predictive routers and cascading routers) show routing can save money and match top-model accuracy, but gains depend on judge quality and task.
Problem Statement
There is no standard way to compare systems that pick which LLM to call per request. Builders lack a common dataset, a cost-aware evaluation metric, and broad baselines to judge routing decisions.
Main Contribution
A large, public ROUTERBENCH dataset: 405,467 model responses collected across 11+ LLMs and 8 task families to enable router training and offline evaluation.
A clear cost–quality math framework and a single-score metric (AIQ) based on a non-decreasing convex hull to compare routers over a cost range.
Empirical baselines (predictive KNN/MLP routers and cascading routers) and a pilot study showing where routing helps and where judge accuracy limits gains.
Key Findings
ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.
An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.
Monetary costs for similar-quality outputs often differ by multiplex between models.
Simple predictive routers (KNN and MLP) match top-model performance at lower or similar cost on several tasks but do not consistently beat the Zero baseline.
Cascading routers can approach Oracle performance when the judge is accurate, but degrade quickly as judge error rises.
In a practical RAG setting, routers improve routing by detecting time-sensitive queries and choosing online retriever-enabled models.
Results
Dataset size
Oracle vs GPT-4 on MMLU
Judge sensitivity for cascading routers
Who Should Care
What To Try In 7 Days
Run ROUTERBENCH on your model fleet to map cost vs. quality per task.
Train a simple KNN router on labeled per-model outcomes and test on a held-out split.
Prototype a cascading flow with a lightweight judge and measure judge error impact.
Optimization Features
Model Optimization
- Model routing (per-input model choice)
- Model cascades (try cheap -> expensive)
System Optimization
- Exploiting cheaper models to lower total inference dollars
Training Optimization
- Supervised performance predictors (KNN, MLP trained on embeddings)
Inference Optimization
- Zero-router convex hull interpolation to pick probabilistic mixes
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses only on performance and dollar cost; omits latency and throughput.
- Not all LLMs and task types are covered; future updates needed to keep current.
- RAG evaluation limited to models with built-in retrieval; two-stage retriever+LLM routing is not fully explored.
When Not To Use
- When latency or throughput constraints are primary optimization targets.
- When you need routers evaluated on a much larger or different model pool than provided.
- When you rely on an imperfect judge and cannot measure judge error.
Failure Modes
- Cascading routers fail if the quality judge error > ~0.2.
- Predictive routers may not generalize across tasks and can underperform on some benchmarks.
- Some models (overly aligned) refuse answers, skewing routing statistics.
Core Entities
Models
- Llama-70B-chat
- Mixtral-8x7B-chat
- Yi-34B-chat
- Code Llama-34B
- Mistral-7B-chat
- WizardLM-13B
- GPT-4
- GPT-3.5-turbo
- Claude-instant-v1
- Claude-v1
- Claude-v2
- You.com API
- sonar-small-online
- sonar-medium-online
Metrics
- Accuracy
- AIQ (Average Improvement in Quality)
- Cost (USD per request)
- Non-decreasing convex hull (NDCH)
Datasets
- HellaSwag
- Winogrande
- ARC Challenge
- MMLU
- MT-Bench
- GSM8K
- MBPP
- RAG (800 client queries)
Benchmarks
- ROUTERBENCH (this work)

