A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

January 12, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

0

Authors

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, Shuyue Hu

Links

Abstract / PDF

Why It Matters For Business

Routing can improve accuracy or cut API cost, but many published routers give similar gains; practical wins come from curated model pools and simple, cheap routers. Always test routers against your Best Single baseline and measure cost/latency together.

Summary TLDR

LLMRouterBench is an open benchmark and framework for selecting which LLM to run per user query. It aggregates 21 datasets, 33 models, and 391.6K instances (~1.8B tokens) and evaluates 10 routing methods under unified metrics for accuracy, cost, and Pareto tradeoffs. Key takeaways: models are complementary; many routing methods give similar practical performance; commercial routers can fail to beat a single best model; a large gap remains to an Oracle, mainly from model-recall failures. Code and data are public on GitHub.

Problem Statement

Researchers and practitioners lack a large, unified, and reproducible benchmark to compare methods that route queries across multiple LLMs (trading accuracy, cost, and latency). Existing work uses varying model pools and datasets, making apples-to-apples comparisons and deployment analysis hard.

Main Contribution

A large unified benchmark and framework for LLM routing: 21 datasets, 33 models, 391,645 instances (~1.8B tokens).

Standardized metrics and evaluation code that support both performance-oriented and performance-cost routing, plus adapters to run 10 representative routers.

A systematic re-evaluation showing model complementarity, limited differentiation among many routing methods, a large gap to the Oracle, and practical findings on embeddings, ensemble size, and latency.

Key Findings

Models show clear complementarity: no single model dominates all tasks.

NumbersTable11: dataset-level bests; many datasets led by different models

Many modern routers produce nearly indistinguishable accuracy under unified evaluation.

NumbersTop routers AvgAcc ≈ 70–72% vs each other

Commercial router OpenRouter failed to beat Best Single by a large margin.

NumbersOpenRouter performance = −24.7% vs Best Single on evaluated pool

Large gap to the Oracle is driven mainly by model-recall failures.

NumbersOracle Avg ≈ 91.6% vs top routers ≈ 71.2% → gap ≈ 20.4pp

Embedding backbone choice has limited impact on router outcomes for current methods.

NumbersEmbed tests: GraphRouter 70.29 vs nli-bert 69.6 vs MiniLM 68.05

Adding more models yields diminishing returns; careful curation of a moderate set helps more.

NumbersOracle growth flattens as pool size increases; best-k selection outperforms random larger pools

Avengers-Pro attains near-Pareto-optimal tradeoffs between accuracy and cost.

NumbersAvengers-Pro ParetoDist ≈ 0; dominates frontier in Fig.8

Results

Total instances

Value391,645

Total tokens

Value~1.8B tokens

Data collection cost

Value$2,771.84 (API) + ~1K GPU hours

Accuracy

Value≈ 20.4 percentage points

BaselineOracle AvgAcc ≈ 91.64%

Worst commercial router vs Best Single

ValueOpenRouter −24.7% (relative performance)

BaselineBest Single model

Max reported PerfGain and CostSave

ValuePerfGain up to +4% ; CostSave up to 31.7%

BaselineBest Single model

Embedding sensitivity

ValueSmall changes (≈1–2 pts) across backbones

Baselinegte-qwen2-7B-instruct

Avengers-Pro Pareto performance

ValueNear Pareto-optimal (ParetoDist ≈ 0)

Baselineother routing methods / single models

Hard-query failure (≤3 correct models)

Value410 queries (11.9% test); Avengers 24.6%, EmbedLLM 23.2%

BaselineOracle or ideal selection

Who Should Care

What To Try In 7 Days

Run Best Single baseline on your workload to set a benchmark.

Use a cheap embedding+clustering router (Avengers variant) on a curated subset of 5–10 models and compare cost/accuracy.

Measure per-model latency and cost; add latency to the decision metric if response time matters.

Optimization Features

Token Efficiency

  • Latency proxies via tokens-per-second

System Optimization

  • Pareto analysis for cost-performance tradeoffs

Inference Optimization

  • Model Routing
  • Model Cascades
  • Cost-Aware Selection

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not exhaustive: covers 10 open-source/flagship routers but not every published method.
  • Dataset scope excludes domain-specific, very long-context, and multimodal tasks.
  • Latency estimates are approximate (based on token counts + provider throughput).

When Not To Use

  • If your application requires multimodal or very long-context routing (not covered).
  • If your model pool or cost profile differs greatly from the evaluated providers.
  • If you need production-grade latency numbers: benchmark in your serving environment.

Failure Modes

  • Model-recall failures: routers often miss the rare specialist that alone answers correctly.
  • Judge bias: some datasets use LLM-based judging which can introduce bias.
  • Provider mismatch: third-party routers (OpenRouter) may use different pools and underperform on a custom pool.

Core Entities

Models

  • GPT-5
  • GPT-5-Chat
  • Gemini-Flash
  • Gemini-Pro
  • Claude-v4
  • GLM-4.6
  • Qwen3-235B
  • Qwen3-Thinking
  • DeepSeek-R1
  • DeepSeek-V3
  • DS-V3.1-Tms
  • Intern-S1
  • Qwen3-8B
  • Qwen-Coder
  • Intern-S1-mini
  • MiniCPM
  • NVIDIA-Nemo
  • Cogito-v1
  • Gemma-2-it
  • Llama-3.1-it
  • DH-Llama3-it
  • Fin-R1
  • GLM-Z1
  • OpenThinker
  • MiMo-RL
  • Granite-3.3-it
  • Internlm3-it
  • Kimi-K2
  • DeepHermes-3
  • MiMo-RL (duplicate?)

Metrics

  • AvgAcc
  • Gain@R
  • Gain@B
  • Gap@O
  • PerfGain
  • CostSave
  • ParetoDist
  • Inference cost ($/1M tokens)
  • Tokens/latency proxies

Datasets

  • AIME
  • MATH500
  • MathBench
  • MBPP
  • HumanEval
  • LiveCodeBench
  • KORBench
  • KnightsAndKnaves
  • BBH
  • MMLU-Pro
  • GPQA
  • FinQA
  • MedQA
  • EmoryNLP
  • MELD
  • LiveMathBench
  • SWE-Bench
  • HLE
  • SimpleQA
  • ArenaHard
  • τ2-Bench

Benchmarks

  • LLMRouterBench
  • RouterBench
  • EmbedLLM
  • RouterEval
  • FusionFactory
  • RouterArena