Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
8
Why It Matters For Business
You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.
Summary TLDR
The paper shows how to reuse per-example outputs and scores from existing benchmarks to train lightweight "correctness" predictors for many LLMs and then route (select) the best model for a new task. Predictors are simple kNN classifiers on sentence embeddings; three selection scores are proposed, including one that models predictor accuracy on the new task. On 29 HELM datasets and on MixInstruct, routing improves over picking the single best model-on-average and often picks smaller, cheaper models. The main limits are out-of-distribution gaps and the need for benchmark coverage or a small number of labeled task examples.
Problem Statement
There are many open LLMs and many benchmarks, but no single model wins on all tasks. Practitioners need a cheap way to pick the best model for a new task without running every candidate LLM on every input.
Main Contribution
Formalize LLM routing as learning per-model binary correctness predictors from benchmark by-products (per-sample performance).
Propose three practical routing scores (S1,S2,S3) including an OOD-aware score that models predictor accuracy on a new task and a simple way to estimate it.
Empirically show routing improves model choice on 29 HELM datasets and MixInstruct, often choosing smaller models and needing far fewer model calls at inference.
Key Findings
OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM
Knowing true predictor accuracy gives a strong win
Correctness predictors are imperfect across held-out tasks
Small amounts of in-task labeling reduce OOD gap and help routing
Per-instance routing on MixInstruct is efficient and competitive
Routing often picks smaller models and reduces average model size
Results
Accuracy
Accuracy
Accuracy
MixInstruct per-instance BERTScore
Inference cost (model calls per instance)
Kernel smoother estimation error (p(d,m))
Who Should Care
What To Try In 7 Days
Collect per-sample benchmark outputs you already have and build simple embeddings with all-mpnet-base-v2.
Train per-model kNN correctness predictors (k=5–10) on those embeddings and implement S1 and S3 scoring.
If possible, label 10–50 examples from your task and measure predictor accuracy to improve S3 estimates.
Optimization Features
Inference Optimization
- Model Routing
- Model Cascades
- Cost-aware model selection
Reproducibility
Data Urls
- HELM (Liang et al., 2022)
- MixInstruct (Jiang et al., 2023)
- MMLU subset
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Quality depends on benchmark coverage: routing fails if the new task is far from benchmark datasets.
- Correctness predictors are noisy OOD (avg acc ≈ 0.59), so model selection can be unstable without adaptation or labels.
- Paper uses simple kNN and a 1-D task-distance estimator; more advanced classifiers or descriptors may be required in harder settings.
- The approach assumes available per-sample benchmark outputs and consistent evaluation metrics across datasets.
When Not To Use
- When no benchmark data is remotely similar to your task.
- When you cannot compute embedding distances (e.g., non-text modalities without matching embeddings).
- When per-input generation quality matters beyond a binary correct/incorrect signal.
Failure Modes
- Estimator p(d,m) mis-estimated → router picks underperforming model for the task.
- Benchmark metric mismatch causes predictors to learn the wrong notion of correctness.
- Sparse benchmark coverage leads to overconfident but wrong router recommendations.
Core Entities
Models
- codegen-16b-mono
- dial-flant5-xl
- falcon-40b
- flan-t5-xl
- flan-t5-xxl
- flan-ul2
- gpt-jt-6b-v1
- gpt-neox-20b
- mpt-7b-instruct
- mt0-xxl
- llama-2-13b
- llama-2-13b-chat
- llama-2-13b-chat-beam
- llama-2-70b
- llama-2-70b-chat
- llama-2-7b
- llama-2-7b-chat
- starcoder
Metrics
- Accuracy
- BERTScore
- BARTScore
- BLEURT
- log-likelihood
Datasets
- HELM (29 selected datasets)
- MixInstruct
- MMLU (subset used within HELM)
Benchmarks
- HELM
- MixInstruct

