A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

January 12, 20268 min

Overview

Decision SnapshotReady For Pilot

The benchmark is large and reproducible, enabling direct comparisons and cost/latency analyses; evidence is broad but confined to the included 21 datasets and 33 models.

Citations0

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 40%

Authors

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, Shuyue Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Routing can improve accuracy or cut API cost, but many published routers give similar gains; practical wins come from curated model pools and simple, cheap routers. Always test routers against your Best Single baseline and measure cost/latency together.

Who Should Care

Summary TLDR

LLMRouterBench is an open benchmark and framework for selecting which LLM to run per user query. It aggregates 21 datasets, 33 models, and 391.6K instances (~1.8B tokens) and evaluates 10 routing methods under unified metrics for accuracy, cost, and Pareto tradeoffs. Key takeaways: models are complementary; many routing methods give similar practical performance; commercial routers can fail to beat a single best model; a large gap remains to an Oracle, mainly from model-recall failures. Code and data are public on GitHub.

Problem Statement

Researchers and practitioners lack a large, unified, and reproducible benchmark to compare methods that route queries across multiple LLMs (trading accuracy, cost, and latency). Existing work uses varying model pools and datasets, making apples-to-apples comparisons and deployment analysis hard.

Main Contribution

A large unified benchmark and framework for LLM routing: 21 datasets, 33 models, 391,645 instances (~1.8B tokens).

Standardized metrics and evaluation code that support both performance-oriented and performance-cost routing, plus adapters to run 10 representative routers.

Key Findings

Models show clear complementarity: no single model dominates all tasks.

NumbersTable11: dataset-level bests; many datasets led by different models

Practical UseBuild small curated ensembles of specialists rather than rely on one model; routing can exploit those differences to improve accuracy.

Evidence RefTable 11, Fig.3

Many modern routers produce nearly indistinguishable accuracy under unified evaluation.

NumbersTop routers AvgAcc ≈ 7072% vs each other

Practical UseStart with simple embedding- or clustering-based routers (cheap to run). Heavy router training often gives little extra accuracy in standard benchmarks.

Evidence RefFig.4, Table11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total instances391,645allCombined performance-oriented and performance-cost pools§3.5, Table 2
Total tokens~1.8B tokensallSum of both settings§3.5, Table 2

What To Try In 7 Days

Run Best Single baseline on your workload to set a benchmark.

Use a cheap embedding+clustering router (Avengers variant) on a curated subset of 5–10 models and compare cost/accuracy.

Measure per-model latency and cost; add latency to the decision metric if response time matters.

Optimization Features

Token Efficiency
Latency proxies via tokens-per-second
System Optimization
Pareto analysis for cost-performance tradeoffs
Inference Optimization
Model RoutingModel CascadesCost-Aware Selection

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Not exhaustive: covers 10 open-source/flagship routers but not every published method.

Dataset scope excludes domain-specific, very long-context, and multimodal tasks.

When Not To Use

If your application requires multimodal or very long-context routing (not covered).

If your model pool or cost profile differs greatly from the evaluated providers.

Failure Modes

Model-recall failures: routers often miss the rare specialist that alone answers correctly.

Judge bias: some datasets use LLM-based judging which can introduce bias.

Core Entities

Models

GPT-5GPT-5-ChatGemini-FlashGemini-ProClaude-v4GLM-4.6Qwen3-235BQwen3-ThinkingDeepSeek-R1DeepSeek-V3DS-V3.1-TmsIntern-S1Qwen3-8BQwen-CoderIntern-S1-miniMiniCPMNVIDIA-NemoCogito-v1Gemma-2-itLlama-3.1-itDH-Llama3-itFin-R1GLM-Z1OpenThinkerMiMo-RLGranite-3.3-itInternlm3-itKimi-K2DeepHermes-3MiMo-RL (duplicate?)

Metrics

AvgAccGain@RGain@BGap@OPerfGainCostSaveParetoDistInference cost ($/1M tokens)Tokens/latency proxies

Datasets

AIMEMATH500MathBenchMBPPHumanEvalLiveCodeBenchKORBenchKnightsAndKnavesBBHMMLU-ProGPQAFinQAMedQAEmoryNLPMELDLiveMathBenchSWE-BenchHLESimpleQAArenaHardτ2-Bench

Benchmarks

LLMRouterBenchRouterBenchEmbedLLMRouterEvalFusionFactoryRouterArena