RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

March 18, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

1

Authors

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay

Links

Abstract / PDF

Why It Matters For Business

Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.

Summary TLDR

RouterBench is a purpose-built benchmark and dataset for evaluating multi-LLM routing systems. It packs 405,467 recorded outputs from a mix of open and proprietary models across eight tasks (reasoning, math, coding, conversation, RAG, etc.). The paper adds a simple math view (cost-quality plane, non-decreasing convex hull) and a scalar AIQ metric to compare routers. Baselines (KNN/MLP predictive routers and cascading routers) show routing can save money and match top-model accuracy, but gains depend on judge quality and task.

Problem Statement

There is no standard way to compare systems that pick which LLM to call per request. Builders lack a common dataset, a cost-aware evaluation metric, and broad baselines to judge routing decisions.

Main Contribution

A large, public ROUTERBENCH dataset: 405,467 model responses collected across 11+ LLMs and 8 task families to enable router training and offline evaluation.

A clear cost–quality math framework and a single-score metric (AIQ) based on a non-decreasing convex hull to compare routers over a cost range.

Empirical baselines (predictive KNN/MLP routers and cascading routers) and a pilot study showing where routing helps and where judge accuracy limits gains.

Key Findings

ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.

Numbers405,467 samples; 11+ models; 8 datasets

An Oracle router (perfect per-example selection) attains much higher quality at far lower cost than always using expensive top models.

NumbersExample: Oracle MMLU 0.957 @ $0.297 vs GPT-4 0.828 @ $4.086

Monetary costs for similar-quality outputs often differ by multiplex between models.

NumbersAuthors report cost differences of ~2–5× for comparable performance

Simple predictive routers (KNN and MLP) match top-model performance at lower or similar cost on several tasks but do not consistently beat the Zero baseline.

NumbersKNN/MLP trained 70% / evaluated 30%; higher AIQ than Zero for MMLU and Winogrande; mixed/underperforming on ARC-Chall/MB

Cascading routers can approach Oracle performance when the judge is accurate, but degrade quickly as judge error rises.

NumbersGood performance up to judge error ≈0.1; performance drops sharply above ≈0.2

In a practical RAG setting, routers improve routing by detecting time-sensitive queries and choosing online retriever-enabled models.

NumbersRAG split: 800 client queries; routers outperform Zero Router on RAG

Results

Dataset size

Value405,467 samples

Oracle vs GPT-4 on MMLU

ValueOracle perf 0.957 @ $0.297; GPT-4 perf 0.828 @ $4.086

BaselineGPT-4

Judge sensitivity for cascading routers

ValueEffective at error ≤0.1; degrades sharply >0.2

BaselineZero router

Who Should Care

What To Try In 7 Days

Run ROUTERBENCH on your model fleet to map cost vs. quality per task.

Train a simple KNN router on labeled per-model outcomes and test on a held-out split.

Prototype a cascading flow with a lightweight judge and measure judge error impact.

Optimization Features

Model Optimization

  • Model routing (per-input model choice)
  • Model cascades (try cheap -> expensive)

System Optimization

  • Exploiting cheaper models to lower total inference dollars

Training Optimization

  • Supervised performance predictors (KNN, MLP trained on embeddings)

Inference Optimization

  • Zero-router convex hull interpolation to pick probabilistic mixes

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses only on performance and dollar cost; omits latency and throughput.
  • Not all LLMs and task types are covered; future updates needed to keep current.
  • RAG evaluation limited to models with built-in retrieval; two-stage retriever+LLM routing is not fully explored.

When Not To Use

  • When latency or throughput constraints are primary optimization targets.
  • When you need routers evaluated on a much larger or different model pool than provided.
  • When you rely on an imperfect judge and cannot measure judge error.

Failure Modes

  • Cascading routers fail if the quality judge error > ~0.2.
  • Predictive routers may not generalize across tasks and can underperform on some benchmarks.
  • Some models (overly aligned) refuse answers, skewing routing statistics.

Core Entities

Models

  • Llama-70B-chat
  • Mixtral-8x7B-chat
  • Yi-34B-chat
  • Code Llama-34B
  • Mistral-7B-chat
  • WizardLM-13B
  • GPT-4
  • GPT-3.5-turbo
  • Claude-instant-v1
  • Claude-v1
  • Claude-v2
  • You.com API
  • sonar-small-online
  • sonar-medium-online

Metrics

  • Accuracy
  • AIQ (Average Improvement in Quality)
  • Cost (USD per request)
  • Non-decreasing convex hull (NDCH)

Datasets

  • HellaSwag
  • Winogrande
  • ARC Challenge
  • MMLU
  • MT-Bench
  • GSM8K
  • MBPP
  • RAG (800 client queries)

Benchmarks

  • ROUTERBENCH (this work)