Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Routing can improve accuracy or cut API cost, but many published routers give similar gains; practical wins come from curated model pools and simple, cheap routers. Always test routers against your Best Single baseline and measure cost/latency together.
Summary TLDR
LLMRouterBench is an open benchmark and framework for selecting which LLM to run per user query. It aggregates 21 datasets, 33 models, and 391.6K instances (~1.8B tokens) and evaluates 10 routing methods under unified metrics for accuracy, cost, and Pareto tradeoffs. Key takeaways: models are complementary; many routing methods give similar practical performance; commercial routers can fail to beat a single best model; a large gap remains to an Oracle, mainly from model-recall failures. Code and data are public on GitHub.
Problem Statement
Researchers and practitioners lack a large, unified, and reproducible benchmark to compare methods that route queries across multiple LLMs (trading accuracy, cost, and latency). Existing work uses varying model pools and datasets, making apples-to-apples comparisons and deployment analysis hard.
Main Contribution
A large unified benchmark and framework for LLM routing: 21 datasets, 33 models, 391,645 instances (~1.8B tokens).
Standardized metrics and evaluation code that support both performance-oriented and performance-cost routing, plus adapters to run 10 representative routers.
A systematic re-evaluation showing model complementarity, limited differentiation among many routing methods, a large gap to the Oracle, and practical findings on embeddings, ensemble size, and latency.
Key Findings
Models show clear complementarity: no single model dominates all tasks.
Many modern routers produce nearly indistinguishable accuracy under unified evaluation.
Commercial router OpenRouter failed to beat Best Single by a large margin.
Large gap to the Oracle is driven mainly by model-recall failures.
Embedding backbone choice has limited impact on router outcomes for current methods.
Adding more models yields diminishing returns; careful curation of a moderate set helps more.
Avengers-Pro attains near-Pareto-optimal tradeoffs between accuracy and cost.
Results
Total instances
Total tokens
Data collection cost
Accuracy
Worst commercial router vs Best Single
Max reported PerfGain and CostSave
Embedding sensitivity
Avengers-Pro Pareto performance
Hard-query failure (≤3 correct models)
Who Should Care
What To Try In 7 Days
Run Best Single baseline on your workload to set a benchmark.
Use a cheap embedding+clustering router (Avengers variant) on a curated subset of 5–10 models and compare cost/accuracy.
Measure per-model latency and cost; add latency to the decision metric if response time matters.
Optimization Features
Token Efficiency
- Latency proxies via tokens-per-second
System Optimization
- Pareto analysis for cost-performance tradeoffs
Inference Optimization
- Model Routing
- Model Cascades
- Cost-Aware Selection
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not exhaustive: covers 10 open-source/flagship routers but not every published method.
- Dataset scope excludes domain-specific, very long-context, and multimodal tasks.
- Latency estimates are approximate (based on token counts + provider throughput).
When Not To Use
- If your application requires multimodal or very long-context routing (not covered).
- If your model pool or cost profile differs greatly from the evaluated providers.
- If you need production-grade latency numbers: benchmark in your serving environment.
Failure Modes
- Model-recall failures: routers often miss the rare specialist that alone answers correctly.
- Judge bias: some datasets use LLM-based judging which can introduce bias.
- Provider mismatch: third-party routers (OpenRouter) may use different pools and underperform on a custom pool.
Core Entities
Models
- GPT-5
- GPT-5-Chat
- Gemini-Flash
- Gemini-Pro
- Claude-v4
- GLM-4.6
- Qwen3-235B
- Qwen3-Thinking
- DeepSeek-R1
- DeepSeek-V3
- DS-V3.1-Tms
- Intern-S1
- Qwen3-8B
- Qwen-Coder
- Intern-S1-mini
- MiniCPM
- NVIDIA-Nemo
- Cogito-v1
- Gemma-2-it
- Llama-3.1-it
- DH-Llama3-it
- Fin-R1
- GLM-Z1
- OpenThinker
- MiMo-RL
- Granite-3.3-it
- Internlm3-it
- Kimi-K2
- DeepHermes-3
- MiMo-RL (duplicate?)
Metrics
- AvgAcc
- Gain@R
- Gain@B
- Gap@O
- PerfGain
- CostSave
- ParetoDist
- Inference cost ($/1M tokens)
- Tokens/latency proxies
Datasets
- AIME
- MATH500
- MathBench
- MBPP
- HumanEval
- LiveCodeBench
- KORBench
- KnightsAndKnaves
- BBH
- MMLU-Pro
- GPQA
- FinQA
- MedQA
- EmoryNLP
- MELD
- LiveMathBench
- SWE-Bench
- HLE
- SimpleQA
- ArenaHard
- τ2-Bench
Benchmarks
- LLMRouterBench
- RouterBench
- EmbedLLM
- RouterEval
- FusionFactory
- RouterArena

