Train a cheap router to send 'easy' queries to small models and save cloud cost while keeping quality

April 22, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

6

Authors

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah

Links

Abstract / PDF

Why It Matters For Business

Route cheap queries to local or smaller models to cut cloud inference costs while keeping user-facing quality high; thresholds let operators trade cost vs quality on demand.

Summary TLDR

The paper introduces a light-weight router that predicts when a small, cheap LLM can match a larger, costly LLM. The router scores queries by the estimated quality gap and sends 'easy' queries to the small model. Three router designs are evaluated: deterministic, probabilistic (samples multiple responses), and probabilistic with a label transformation to handle large model gaps. On a broad MixInstruct testbed the router cuts calls to large models by ~20–40% with little quality loss in many cases. Router inference is cheap (≈0.036s) so overhead is negligible. Use the transformation when the small model is much weaker.

Problem Statement

Large LLMs give better answers but cost a lot to run. Smaller models are cheaper but often worse. Can we automatically decide, per query, which model to call so we reduce cost without hurting answer quality?

Main Contribution

A practical hybrid inference setup that routes each query to either a small or large LLM to trade cost for quality.

Three router training methods: deterministic, probabilistic (samples multiple outputs), and probabilistic with a data-label transformation to strengthen learning when gaps are large.

Empirical evidence on MixInstruct showing substantial reductions in calls to large models with small quality loss, and guidance on threshold selection and evaluation metrics.

Key Findings

The router can route many queries to the small model and keep quality nearly unchanged.

Numbers22% fewer large-model calls with <1% BART drop (Llama-2 13b vs GPT-3.5-turbo)

Using probabilistic labels plus a data transformation (r_trans) helps when the small model is much weaker.

NumbersFor FLAN-T5 (800m) → Llama-2 (13b): 40% cost advantage → 10.3% BART drop; r_trans improves quality drop by ~2.8–3.5% vs.

Router adds negligible latency compared to LLM inference.

NumbersRouter ≈0.036s vs FLAN-T5 0.46s, Llama-2 7.99–14.61s

Training with BART score is a cost-effective proxy but routing quality depends on metric alignment.

NumbersRouting performs well when BART and GPT-4 scores are strongly correlated (Pearson r up to 0.76); performance decays as r

Results

Cost advantage vs BART drop (small gap pairs)

Value20% cost advantage with ≤0.1% BART drop; 40% → ≤0.2% drop (Llama-2 7b vs 13b)

BaselineAll-at-large

Cost advantage vs BART drop (medium gap pairs)

Value20% cost advantage with ≤1% BART drop; 40% with ≤4% drop (Llama-2 13b vs GPT-3.5)

BaselineAll-at-large

Router latency

ValueRouter 0.036s per query (±0.002s)

BaselineLlama-2 (13b) 14.61s

Who Should Care

What To Try In 7 Days

Collect 500–1k calibration queries and compute BART gap between your small and large models.

Train a DeBERTa-based router on those examples using probabilistic labels (sample multiple outputs) and tune a threshold on validation set.

Deploy router in front of your LLM stack and measure percent of queries routed and end-to-end latency; start with conservative threshold (≈10–20% cost adv).

Optimization Features

System Optimization

  • Cheap encoder router (DeBERTa) in front of autoregressive LLMs

Training Optimization

  • Probabilistic labeling via sampling multiple outputs

Inference Optimization

  • Per-query model routing to avoid unnecessary large-model calls

Reproducibility

Data Urls

  • MixInstruct (public; Jiang et al. 2023)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Router relies on an automatic metric (BART) for training; poor metric alignment can degrade real-world quality.
  • When the small model is much weaker, routing becomes harder and quality drop can be substantial without careful transformation.
  • Generalization to new model pairs or out-of-distribution data is limited unless quality-gap correlations hold.

When Not To Use

  • If your small model consistently performs much worse across tasks (large, unbridgeable gap).
  • If you cannot afford to sample multiple outputs per example (probabilistic labels require sampling).
  • If you lack a validation signal that correlates with human judgment (BART poorly aligned).

Failure Modes

  • Router misroutes many hard queries to the small model when the small model is far weaker.
  • Poor choice of transformation parameter t (grid search noise) leads to suboptimal routing.
  • Router trained on one model pair or data distribution fails on different pairs or OOD queries.

Core Entities

Models

  • FLAN-T5 (800m)
  • FLAN-T5 (11b)
  • Llama-2 (7b)
  • Llama-2 (13b)
  • GPT-3.5-turbo
  • DeBERTa-v3-large (router backbone)

Metrics

  • BART score
  • GPT-4 evaluation score (1–10)
  • Cost advantage (fraction routed to small model)
  • Latency (seconds)

Datasets

  • MixInstruct (sampled 20k; 10k train, 5k val, 5k test)

Context Entities

Models

  • Reference to other hybrid/cascade/speculative works (Jiang et al., Kag et al., Leviathan et al.)