Use past benchmark results to learn cheap routers that pick the best LLM for a new task

September 27, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is simple and practical: use embeddings + kNN to predict per-model correctness, then choose models with one of three scores; it works well when new tasks are similar to benchmark tasks or when you have a few labeled examples.

Citations8

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin

Links

Abstract / PDF / Data

Why It Matters For Business

You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.

Who Should Care

Summary TLDR

The paper shows how to reuse per-example outputs and scores from existing benchmarks to train lightweight "correctness" predictors for many LLMs and then route (select) the best model for a new task. Predictors are simple kNN classifiers on sentence embeddings; three selection scores are proposed, including one that models predictor accuracy on the new task. On 29 HELM datasets and on MixInstruct, routing improves over picking the single best model-on-average and often picks smaller, cheaper models. The main limits are out-of-distribution gaps and the need for benchmark coverage or a small number of labeled task examples.

Problem Statement

There are many open LLMs and many benchmarks, but no single model wins on all tasks. Practitioners need a cheap way to pick the best model for a new task without running every candidate LLM on every input.

Main Contribution

Formalize LLM routing as learning per-model binary correctness predictors from benchmark by-products (per-sample performance).

Propose three practical routing scores (S1,S2,S3) including an OOD-aware score that models predictor accuracy on a new task and a simple way to estimate it.

Key Findings

OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM

NumbersS3 acc=0.694 vs BMA (llama-2-70b) acc=0.688 (Table 1 averages)

Practical UseUse the S3 score (models + estimated predictor accuracy) to get modest accuracy gains versus always using the largest model.

Evidence RefTable 1

Knowing true predictor accuracy gives a strong win

NumbersS3 with true p acc=0.735 vs S3 est. acc=0.694 (Table 1)

Practical UseIf you can measure predictor accuracy on a few task samples, router quality rises substantially — collect small labels when possible.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyS1=0.662, S2=0.676, S3=0.694BMA (llama-2-70b)=0.688S3 +0.006 vs BMAHELM (29 datasets, leave-one-out averaged)Table 1: S1,S2,S3 averaged across 29 held-out tasksTable 1
Accuracy0.773BMA=0.688Oracle +0.085 vs BMAHELMTable 1 Oracle rowTable 1

What To Try In 7 Days

Collect per-sample benchmark outputs you already have and build simple embeddings with all-mpnet-base-v2.

Train per-model kNN correctness predictors (k=5–10) on those embeddings and implement S1 and S3 scoring.

If possible, label 10–50 examples from your task and measure predictor accuracy to improve S3 estimates.

Optimization Features

Inference Optimization
Model RoutingModel CascadesCost-aware model selection

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HELM (Liang et al., 2022)MixInstruct (Jiang et al., 2023)MMLU subset

Risks & Boundaries

Limitations

Quality depends on benchmark coverage: routing fails if the new task is far from benchmark datasets.

Correctness predictors are noisy OOD (avg acc ≈ 0.59), so model selection can be unstable without adaptation or labels.

When Not To Use

When no benchmark data is remotely similar to your task.

When you cannot compute embedding distances (e.g., non-text modalities without matching embeddings).

Failure Modes

Estimator p(d,m) mis-estimated → router picks underperforming model for the task.

Benchmark metric mismatch causes predictors to learn the wrong notion of correctness.

Core Entities

Models

codegen-16b-monodial-flant5-xlfalcon-40bflan-t5-xlflan-t5-xxlflan-ul2gpt-jt-6b-v1gpt-neox-20bmpt-7b-instructmt0-xxlllama-2-13bllama-2-13b-chatllama-2-13b-chat-beamllama-2-70bllama-2-70b-chatllama-2-7bllama-2-7b-chatstarcoder

Metrics

AccuracyBERTScoreBARTScoreBLEURTlog-likelihood

Datasets

HELM (29 selected datasets)MixInstructMMLU (subset used within HELM)

Benchmarks

HELMMixInstruct