Use past benchmark results to learn cheap routers that pick the best LLM for a new task

Overview

Decision SnapshotReady For Pilot

The method is simple and practical: use embeddings + kNN to predict per-model correctness, then choose models with one of three scores; it works well when new tasks are similar to benchmark tasks or when you have a few labeled examples.

Citations8

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin

Links

Abstract / PDF / Data

Why It Matters For Business

You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper shows how to reuse per-example outputs and scores from existing benchmarks to train lightweight "correctness" predictors for many LLMs and then route (select) the best model for a new task. Predictors are simple kNN classifiers on sentence embeddings; three selection scores are proposed, including one that models predictor accuracy on the new task. On 29 HELM datasets and on MixInstruct, routing improves over picking the single best model-on-average and often picks smaller, cheaper models. The main limits are out-of-distribution gaps and the need for benchmark coverage or a small number of labeled task examples.

Problem Statement

There are many open LLMs and many benchmarks, but no single model wins on all tasks. Practitioners need a cheap way to pick the best model for a new task without running every candidate LLM on every input.

Main Contribution

Formalize LLM routing as learning per-model binary correctness predictors from benchmark by-products (per-sample performance).

Propose three practical routing scores (S1,S2,S3) including an OOD-aware score that models predictor accuracy on a new task and a simple way to estimate it.

Key Findings

OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM

NumbersS3 acc=0.694 vs BMA (llama-2-70b) acc=0.688 (Table 1 averages)

Practical UseUse the S3 score (models + estimated predictor accuracy) to get modest accuracy gains versus always using the largest model.

Evidence RefTable 1

Knowing true predictor accuracy gives a strong win

NumbersS3 with true p acc=0.735 vs S3 est. acc=0.694 (Table 1)

Practical UseIf you can measure predictor accuracy on a few task samples, router quality rises substantially — collect small labels when possible.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	S1=0.662, S2=0.676, S3=0.694	BMA (llama-2-70b)=0.688	S3 +0.006 vs BMA	HELM (29 datasets, leave-one-out averaged)	Table 1: S1,S2,S3 averaged across 29 held-out tasks	Table 1
Accuracy	0.773	BMA=0.688	Oracle +0.085 vs BMA	HELM	Table 1 Oracle row	Table 1

What To Try In 7 Days

Collect per-sample benchmark outputs you already have and build simple embeddings with all-mpnet-base-v2.

Train per-model kNN correctness predictors (k=5–10) on those embeddings and implement S1 and S3 scoring.

If possible, label 10–50 examples from your task and measure predictor accuracy to improve S3 estimates.

Optimization Features

Inference Optimization

Model RoutingModel CascadesCost-aware model selection

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

HELM (Liang et al., 2022)MixInstruct (Jiang et al., 2023)MMLU subset

Risks & Boundaries

Limitations

Quality depends on benchmark coverage: routing fails if the new task is far from benchmark datasets.

Correctness predictors are noisy OOD (avg acc ≈ 0.59), so model selection can be unstable without adaptation or labels.

When Not To Use

When no benchmark data is remotely similar to your task.

When you cannot compute embedding distances (e.g., non-text modalities without matching embeddings).

Failure Modes

Estimator p(d,m) mis-estimated → router picks underperforming model for the task.

Benchmark metric mismatch causes predictors to learn the wrong notion of correctness.

Core Entities

Models

codegen-16b-monodial-flant5-xlfalcon-40bflan-t5-xlflan-t5-xxlflan-ul2gpt-jt-6b-v1gpt-neox-20bmpt-7b-instructmt0-xxlllama-2-13bllama-2-13b-chatllama-2-13b-chat-beamllama-2-70bllama-2-70b-chatllama-2-7bllama-2-7b-chatstarcoder

Metrics

AccuracyBERTScoreBARTScoreBLEURTlog-likelihood

Datasets

HELM (29 selected datasets)MixInstructMMLU (subset used within HELM)

Benchmarks

HELMMixInstruct

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM

Knowing true predictor accuracy gives a strong win

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding