Overview
The method is simple and practical: use embeddings + kNN to predict per-model correctness, then choose models with one of three scores; it works well when new tasks are similar to benchmark tasks or when you have a few labeled examples.
Citations8
Evidence Strength0.80
Confidence0.86
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.
Who Should Care
Summary TLDR
The paper shows how to reuse per-example outputs and scores from existing benchmarks to train lightweight "correctness" predictors for many LLMs and then route (select) the best model for a new task. Predictors are simple kNN classifiers on sentence embeddings; three selection scores are proposed, including one that models predictor accuracy on the new task. On 29 HELM datasets and on MixInstruct, routing improves over picking the single best model-on-average and often picks smaller, cheaper models. The main limits are out-of-distribution gaps and the need for benchmark coverage or a small number of labeled task examples.
Problem Statement
There are many open LLMs and many benchmarks, but no single model wins on all tasks. Practitioners need a cheap way to pick the best model for a new task without running every candidate LLM on every input.
Main Contribution
Formalize LLM routing as learning per-model binary correctness predictors from benchmark by-products (per-sample performance).
Propose three practical routing scores (S1,S2,S3) including an OOD-aware score that models predictor accuracy on a new task and a simple way to estimate it.
Key Findings
OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM
Knowing true predictor accuracy gives a strong win
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | S1=0.662, S2=0.676, S3=0.694 | BMA (llama-2-70b)=0.688 | S3 +0.006 vs BMA | HELM (29 datasets, leave-one-out averaged) | Table 1: S1,S2,S3 averaged across 29 held-out tasks | Table 1 |
| Accuracy | 0.773 | BMA=0.688 | Oracle +0.085 vs BMA | HELM | Table 1 Oracle row | Table 1 |
What To Try In 7 Days
Collect per-sample benchmark outputs you already have and build simple embeddings with all-mpnet-base-v2.
Train per-model kNN correctness predictors (k=5–10) on those embeddings and implement S1 and S3 scoring.
If possible, label 10–50 examples from your task and measure predictor accuracy to improve S3 estimates.
Optimization Features
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Quality depends on benchmark coverage: routing fails if the new task is far from benchmark datasets.
Correctness predictors are noisy OOD (avg acc ≈ 0.59), so model selection can be unstable without adaptation or labels.
When Not To Use
When no benchmark data is remotely similar to your task.
When you cannot compute embedding distances (e.g., non-text modalities without matching embeddings).
Failure Modes
Estimator p(d,m) mis-estimated → router picks underperforming model for the task.
Benchmark metric mismatch causes predictors to learn the wrong notion of correctness.

