MetaLLM: route each query to the cheapest LLM likely to be correct, cutting cost up to 60% while keeping or improving accuracy

July 15, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Quang H. Nguyen, Thinh Dao, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan

Links

Abstract / PDF

Why It Matters For Business

MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.

Summary TLDR

MetaLLM is a lightweight wrapper that inspects each input, embeds it with Sentence-BERT, and uses a contextual multi-armed bandit (linear reward + UCB) to pick which LLM to call. The reward trades accuracy against monetary cost (reward = correctness − p·cost). Experiments on SST-2 (OpenAI models) and MMLU (Together AI models) show MetaLLM can slightly improve accuracy over the best single model (≈+0.5–1.3%) while reducing API spend (up to ~60% vs. the most expensive OpenAI model, ~10% on Together AI). The system supports offline training, online updates, and hybrid modes.

Problem Statement

Users have access to many LLMs with different costs and strengths. Picking one fixed model wastes money or accuracy on per-query basis. The problem: route each query to the LLM that maximizes correctness while respecting a cost budget.

Main Contribution

MetaLLM: a general wrapper that routes each query to one LLM chosen from a pool to optimize accuracy-cost trade-off.

A practical routing algorithm: use SBERT embeddings, linear reward models trained to predict (accuracy − p·cost), and an online contextual bandit with UCB.

Empirical validation across OpenAI and Together AI models on SST-2 and MMLU showing modest accuracy gains and substantial cost savings.

Key Findings

MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.

NumbersSST-2: 84.06% vs 82.80% (text-babbage); cost 0.12 per 10k

MetaLLM matches or slightly improves top-model accuracy while cutting cost versus the priciest model.

NumbersSST-2: MetaLLM 91.97% vs text-davinci-002 91.51%; cost 2.03 vs 4.80 (~57.7% lower)

MetaLLM improves MMLU accuracy and reduces cost on a heterogeneous provider set.

NumbersMMLU: MetaLLM 80.90% vs Qwen 80.24%; cost 2304.5 vs 2598.56 (~11.3% lower)

Results

Accuracy

ValueMetaLLM 84.06%

Baselinetext-babbage-001 82.80%

Accuracy

ValueMetaLLM 91.97%; cost 2.03

Baselinetext-davinci-002 91.51%; cost 4.80

Accuracy

ValueMetaLLM 80.90%; cost 2304.5

BaselineQwen 80.24%; cost 2598.56

Who Should Care

What To Try In 7 Days

Run a quick prototype: collect 1–2k labeled samples and compute SBERT embeddings.

Train the linear reward model (reward = correctness − p·cost) and pick p to match your budget.

Deploy MetaLLM in online mode with UCB updates and measure cost/accuracy trade-offs for a week.

Agent Features

Memory

  • online updates (contextual bandit)

Tool Use

  • model selection per query

Frameworks

  • contextual multi-armed bandit
  • UCB
  • linear reward model (ridge)

Optimization Features

Token Efficiency

  • cost normalization of token-based API pricing in reward

Model Optimization

  • reward-guided routing to reduce calls to expensive models

System Optimization

  • mix of offline initialization and online updates to adapt to distribution shifts

Training Optimization

  • closed-form ridge regression initialization for linear reward models

Inference Optimization

  • single-model per query saves token/API cost

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on zero-shot classification and multiple-choice QA, not free-form generation.
  • Router is a simple linear model on SBERT embeddings; richer features or nonlinear models might improve routing.
  • Reward combines only correctness and monetary cost; latency, robustness, and other metrics are not included.
  • Offline training can fail if auxiliary training distribution differs strongly from test distribution.

When Not To Use

  • When task output quality needs human judgment or complex free-text evaluation.
  • When no labeled training or validation data is available and online initialization is undesirable.
  • When you must optimize for latency or other non-monetary trade-offs not in the reward.

Failure Modes

  • Poor initialization: without good offline data, online bandit can default to the cheapest model (observed with Gemma).
  • Dynamic cost penalty can push long, complex queries to cheaper models and reduce accuracy.
  • Distribution mismatch between auxiliary training data and production queries can lead to suboptimal routing.

Core Entities

Models

  • text-ada-001
  • text-babbage-001
  • text-curie-001
  • text-davinci-002
  • Gemma
  • Llama
  • Mistral
  • Qwen

Metrics

  • Accuracy
  • api_cost

Datasets

  • SST-2
  • MMLU

Context Entities

Metrics

  • average cost per 10,000 queries