MetaLLM: route each query to the cheapest LLM likely to be correct, cutting cost up to 60% while keeping or improving accuracy

July 15, 20247 min

Overview

Decision SnapshotNeeds Validation

The approach is simple and deployable: embed inputs, predict a linear reward (correctness − p·cost), and use UCB to choose models; experiments across two datasets and two API providers show consistent cost-accuracy benefits.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Quang H. Nguyen, Thinh Dao, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan

Links

Abstract / PDF

Why It Matters For Business

MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.

Who Should Care

Summary TLDR

MetaLLM is a lightweight wrapper that inspects each input, embeds it with Sentence-BERT, and uses a contextual multi-armed bandit (linear reward + UCB) to pick which LLM to call. The reward trades accuracy against monetary cost (reward = correctness − p·cost). Experiments on SST-2 (OpenAI models) and MMLU (Together AI models) show MetaLLM can slightly improve accuracy over the best single model (≈+0.5–1.3%) while reducing API spend (up to ~60% vs. the most expensive OpenAI model, ~10% on Together AI). The system supports offline training, online updates, and hybrid modes.

Problem Statement

Users have access to many LLMs with different costs and strengths. Picking one fixed model wastes money or accuracy on per-query basis. The problem: route each query to the LLM that maximizes correctness while respecting a cost budget.

Main Contribution

MetaLLM: a general wrapper that routes each query to one LLM chosen from a pool to optimize accuracy-cost trade-off.

A practical routing algorithm: use SBERT embeddings, linear reward models trained to predict (accuracy − p·cost), and an online contextual bandit with UCB.

Key Findings

MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.

NumbersSST-2: 84.06% vs 82.80% (text-babbage); cost 0.12 per 10k

Practical UseIf you currently use a mid-cost model, MetaLLM can give ~+1.3% accuracy for the same API spend by routing queries to cheaper models when they suffice.

Evidence RefTable 3 (OpenAI, babbage budget)

MetaLLM matches or slightly improves top-model accuracy while cutting cost versus the priciest model.

NumbersSST-2: MetaLLM 91.97% vs text-davinci-002 91.51%; cost 2.03 vs 4.80 (~57.7% lower)

Practical UseYou can keep or improve accuracy relative to your most expensive model and lower API expenses by ~50–60% using MetaLLM routing.

Evidence RefTable 3 (OpenAI, davinci budget)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyMetaLLM 84.06%text-babbage-001 82.80%+1.26%SST-2 testTable 3: MetaLLM (text-babbage-001 budget)Table 3
AccuracyMetaLLM 91.97%; cost 2.03text-davinci-002 91.51%; cost 4.80+0.46% acc; −57.7% costSST-2 testTable 3: MetaLLM (text-davinci-002 budget)Table 3

What To Try In 7 Days

Run a quick prototype: collect 1–2k labeled samples and compute SBERT embeddings.

Train the linear reward model (reward = correctness − p·cost) and pick p to match your budget.

Deploy MetaLLM in online mode with UCB updates and measure cost/accuracy trade-offs for a week.

Agent Features

Memory
online updates (contextual bandit)
Tool Use
model selection per query
Frameworks
contextual multi-armed banditUCBlinear reward model (ridge)

Optimization Features

Token Efficiency
cost normalization of token-based API pricing in reward
Model Optimization
reward-guided routing to reduce calls to expensive models
System Optimization
mix of offline initialization and online updates to adapt to distribution shifts
Training Optimization
closed-form ridge regression initialization for linear reward models
Inference Optimization
single-model per query saves token/API cost

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on zero-shot classification and multiple-choice QA, not free-form generation.

Router is a simple linear model on SBERT embeddings; richer features or nonlinear models might improve routing.

When Not To Use

When task output quality needs human judgment or complex free-text evaluation.

When no labeled training or validation data is available and online initialization is undesirable.

Failure Modes

Poor initialization: without good offline data, online bandit can default to the cheapest model (observed with Gemma).

Dynamic cost penalty can push long, complex queries to cheaper models and reduce accuracy.

Core Entities

Models

text-ada-001text-babbage-001text-curie-001text-davinci-002GemmaLlamaMistralQwen

Metrics

Accuracyapi_cost

Datasets

SST-2MMLU

Context Entities

Metrics

average cost per 10,000 queries