A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Overview

Decision SnapshotReady For Pilot

The benchmark is large and reproducible, enabling direct comparisons and cost/latency analyses; evidence is broad but confined to the included 21 datasets and 33 models.

Citations0

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 40%

Authors

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, Shuyue Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Routing can improve accuracy or cut API cost, but many published routers give similar gains; practical wins come from curated model pools and simple, cheap routers. Always test routers against your Best Single baseline and measure cost/latency together.

Who Should Care

Product Manager ML Engineer CTO Founder Engineering Lead

Summary TLDR

LLMRouterBench is an open benchmark and framework for selecting which LLM to run per user query. It aggregates 21 datasets, 33 models, and 391.6K instances (~1.8B tokens) and evaluates 10 routing methods under unified metrics for accuracy, cost, and Pareto tradeoffs. Key takeaways: models are complementary; many routing methods give similar practical performance; commercial routers can fail to beat a single best model; a large gap remains to an Oracle, mainly from model-recall failures. Code and data are public on GitHub.

Problem Statement

Researchers and practitioners lack a large, unified, and reproducible benchmark to compare methods that route queries across multiple LLMs (trading accuracy, cost, and latency). Existing work uses varying model pools and datasets, making apples-to-apples comparisons and deployment analysis hard.

Main Contribution

A large unified benchmark and framework for LLM routing: 21 datasets, 33 models, 391,645 instances (~1.8B tokens).

Standardized metrics and evaluation code that support both performance-oriented and performance-cost routing, plus adapters to run 10 representative routers.

Key Findings

Models show clear complementarity: no single model dominates all tasks.

NumbersTable11: dataset-level bests; many datasets led by different models

Practical UseBuild small curated ensembles of specialists rather than rely on one model; routing can exploit those differences to improve accuracy.

Evidence RefTable 11, Fig.3

Many modern routers produce nearly indistinguishable accuracy under unified evaluation.

NumbersTop routers AvgAcc ≈ 70–72% vs each other

Practical UseStart with simple embedding- or clustering-based routers (cheap to run). Heavy router training often gives little extra accuracy in standard benchmarks.

Evidence RefFig.4, Table11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total instances	391,645	—	—	all	Combined performance-oriented and performance-cost pools	§3.5, Table 2
Total tokens	~1.8B tokens	—	—	all	Sum of both settings	§3.5, Table 2

What To Try In 7 Days

Run Best Single baseline on your workload to set a benchmark.

Use a cheap embedding+clustering router (Avengers variant) on a curated subset of 5–10 models and compare cost/accuracy.

Measure per-model latency and cost; add latency to the decision metric if response time matters.

Optimization Features

Token Efficiency

Latency proxies via tokens-per-second

System Optimization

Pareto analysis for cost-performance tradeoffs

Inference Optimization

Model RoutingModel CascadesCost-Aware Selection

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ynulihao/LLMRouterBench

Data URLs

https://github.com/ynulihao/LLMRouterBench

Risks & Boundaries

Limitations

Not exhaustive: covers 10 open-source/flagship routers but not every published method.

Dataset scope excludes domain-specific, very long-context, and multimodal tasks.

When Not To Use

If your application requires multimodal or very long-context routing (not covered).

If your model pool or cost profile differs greatly from the evaluated providers.

Failure Modes

Model-recall failures: routers often miss the rare specialist that alone answers correctly.

Judge bias: some datasets use LLM-based judging which can introduce bias.

Core Entities

Models

GPT-5GPT-5-ChatGemini-FlashGemini-ProClaude-v4GLM-4.6Qwen3-235BQwen3-ThinkingDeepSeek-R1DeepSeek-V3DS-V3.1-TmsIntern-S1Qwen3-8BQwen-CoderIntern-S1-miniMiniCPMNVIDIA-NemoCogito-v1Gemma-2-itLlama-3.1-itDH-Llama3-itFin-R1GLM-Z1OpenThinkerMiMo-RLGranite-3.3-itInternlm3-itKimi-K2DeepHermes-3MiMo-RL (duplicate?)

Metrics

AvgAccGain@RGain@BGap@OPerfGainCostSaveParetoDistInference cost ($/1M tokens)Tokens/latency proxies

Datasets

AIMEMATH500MathBenchMBPPHumanEvalLiveCodeBenchKORBenchKnightsAndKnavesBBHMMLU-ProGPQAFinQAMedQAEmoryNLPMELDLiveMathBenchSWE-BenchHLESimpleQAArenaHardτ2-Bench

Benchmarks

LLMRouterBenchRouterBenchEmbedLLMRouterEvalFusionFactoryRouterArena

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models show clear complementarity: no single model dominates all tasks.

Many modern routers produce nearly indistinguishable accuracy under unified evaluation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding