Use past benchmark results to learn cheap routers that pick the best LLM for a new task

September 27, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

8

Authors

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin

Links

Abstract / PDF

Why It Matters For Business

You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.

Summary TLDR

The paper shows how to reuse per-example outputs and scores from existing benchmarks to train lightweight "correctness" predictors for many LLMs and then route (select) the best model for a new task. Predictors are simple kNN classifiers on sentence embeddings; three selection scores are proposed, including one that models predictor accuracy on the new task. On 29 HELM datasets and on MixInstruct, routing improves over picking the single best model-on-average and often picks smaller, cheaper models. The main limits are out-of-distribution gaps and the need for benchmark coverage or a small number of labeled task examples.

Problem Statement

There are many open LLMs and many benchmarks, but no single model wins on all tasks. Practitioners need a cheap way to pick the best model for a new task without running every candidate LLM on every input.

Main Contribution

Formalize LLM routing as learning per-model binary correctness predictors from benchmark by-products (per-sample performance).

Propose three practical routing scores (S1,S2,S3) including an OOD-aware score that models predictor accuracy on a new task and a simple way to estimate it.

Empirically show routing improves model choice on 29 HELM datasets and MixInstruct, often choosing smaller models and needing far fewer model calls at inference.

Key Findings

OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM

NumbersS3 acc=0.694 vs BMA (llama-2-70b) acc=0.688 (Table 1 averages)

Knowing true predictor accuracy gives a strong win

NumbersS3 with true p acc=0.735 vs S3 est. acc=0.694 (Table 1)

Correctness predictors are imperfect across held-out tasks

NumbersAverage correctness-predictor accuracy = 0.59; kernel smoother MAE=0.116

Small amounts of in-task labeling reduce OOD gap and help routing

Numbersα=0.05 (few samples) raised predictor acc ≈ 0.59 → 0.65 and improved routing (Figure 2)

Per-instance routing on MixInstruct is efficient and competitive

NumbersOur BERTScore=74.75 vs Oracle=77.67 while using 2 model calls per instance vs N (Table 2)

Routing often picks smaller models and reduces average model size

NumbersS3 avg chosen params=49.8B vs BMA=70.0B; S3 true-p avg chosen params=33.8B (Table 1)

Results

Accuracy

ValueS1=0.662, S2=0.676, S3=0.694

BaselineBMA (llama-2-70b)=0.688

Accuracy

Value0.773

BaselineBMA=0.688

Accuracy

Value0.59

MixInstruct per-instance BERTScore

ValueOur method=74.75

BaselineOracle=77.67

Inference cost (model calls per instance)

ValueOur method typically needs 1 model generation + embeddings; reported MCPI=2 for MixInstruct

BaselineOther scoring methods require N model generations (one per candidate model)

Kernel smoother estimation error (p(d,m))

ValueMAE = 0.116

Who Should Care

What To Try In 7 Days

Collect per-sample benchmark outputs you already have and build simple embeddings with all-mpnet-base-v2.

Train per-model kNN correctness predictors (k=5–10) on those embeddings and implement S1 and S3 scoring.

If possible, label 10–50 examples from your task and measure predictor accuracy to improve S3 estimates.

Optimization Features

Inference Optimization

  • Model Routing
  • Model Cascades
  • Cost-aware model selection

Reproducibility

Data Urls

  • HELM (Liang et al., 2022)
  • MixInstruct (Jiang et al., 2023)
  • MMLU subset

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Quality depends on benchmark coverage: routing fails if the new task is far from benchmark datasets.
  • Correctness predictors are noisy OOD (avg acc ≈ 0.59), so model selection can be unstable without adaptation or labels.
  • Paper uses simple kNN and a 1-D task-distance estimator; more advanced classifiers or descriptors may be required in harder settings.
  • The approach assumes available per-sample benchmark outputs and consistent evaluation metrics across datasets.

When Not To Use

  • When no benchmark data is remotely similar to your task.
  • When you cannot compute embedding distances (e.g., non-text modalities without matching embeddings).
  • When per-input generation quality matters beyond a binary correct/incorrect signal.

Failure Modes

  • Estimator p(d,m) mis-estimated → router picks underperforming model for the task.
  • Benchmark metric mismatch causes predictors to learn the wrong notion of correctness.
  • Sparse benchmark coverage leads to overconfident but wrong router recommendations.

Core Entities

Models

  • codegen-16b-mono
  • dial-flant5-xl
  • falcon-40b
  • flan-t5-xl
  • flan-t5-xxl
  • flan-ul2
  • gpt-jt-6b-v1
  • gpt-neox-20b
  • mpt-7b-instruct
  • mt0-xxl
  • llama-2-13b
  • llama-2-13b-chat
  • llama-2-13b-chat-beam
  • llama-2-70b
  • llama-2-70b-chat
  • llama-2-7b
  • llama-2-7b-chat
  • starcoder

Metrics

  • Accuracy
  • BERTScore
  • BARTScore
  • BLEURT
  • log-likelihood

Datasets

  • HELM (29 selected datasets)
  • MixInstruct
  • MMLU (subset used within HELM)

Benchmarks

  • HELM
  • MixInstruct