SelectLLM routes each query to a small subset of LLMs to keep accuracy high while cutting inference latency.

August 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

1

Authors

Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar

Links

Abstract / PDF

Why It Matters For Business

You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.

Summary TLDR

SelectLLM is a practical system that uses a lightweight multi-label classifier to predict which LLMs in a pool can solve a given query, then applies a confidence-weighted selection policy (WEIGHTEDMAXCONF) to call only that subset. On two reasoning benchmarks the method matches or beats ensemble baselines while lowering latency: +1.90 accuracy on GSM8K and +4.89 on MMLU vs. the All-LLMs baseline, and reduces inference time by ~13% (GSM8K) and ~70% (MMLU) compared to top-performing ensemble baselines. Limits: focused on discrete-answer reasoning tasks, relies on a classifier with modest F1 (0.71/0.68) and on extracting discrete answers from LLM outputs.

Problem Statement

Using many LLMs can improve accuracy but querying all models for every input is slow and costly. We need a fast, query-aware way to pick a small subset of LLMs that together give a correct answer while reducing inference latency.

Main Contribution

A query-aware selection algorithm (SELECTLLM) that uses a multi-label classifier to predict which LLMs can solve each input and a confidence-based policy to pick a small subset.

A new confidence-weighted ensembling policy (WEIGHTEDMAXCONF) that adjusts majority-vote counts by model confidences to reduce bias.

Evaluation on GSM8K and MMLU showing competitive accuracy with much lower latency and analysis of an Oracle upper bound and linguistic failure modes.

Key Findings

SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.

NumbersGSM8K: 76.04→77.94 (+1.90); MMLU: 60.92→65.81 (+4.89)

SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.

NumbersGSM8K latency 19.00→16.50 (≈13% drop); MMLU 16.40→4.78 (≈70% drop)

A small classifier is the core routing component but its quality limits gains.

NumbersMLC weighted F1: 0.71 (GSM8K), 0.68 (MMLU)

Oracle upper bound is far higher than current method, indicating room for improvement.

NumbersOracle Acc: GSM8K 90.52; MMLU 90.46; SELECTLLM upper bound (via MLC labels): 78.77 / 76.20

SELECTLLM struggles with linguistic phenomena that increase difficulty across models.

NumbersHard question groups include quantifiers, age/other units, ordinals and complex ratios (analysis in Section 6.1)

Results

Accuracy

Value77.94%

BaselineAll LLMs 76.04%

Accuracy

Value65.81%

BaselineAll LLMs 60.92%

Latency (sec per query)

Value16.50s

BaselineTop-s LLMs 19.00s

Latency (sec per query)

Value4.78s

BaselineTop-s LLMs 16.40s

MLC weighted F1

Value0.71 (GSM8K) / 0.68 (MMLU)

Accuracy

Value90.52% (GSM8K) / 90.46% (MMLU)

BaselineSELECTLLM upper bound 78.77% / 76.20%

Who Should Care

What To Try In 7 Days

Build a small multi-label classifier (RoBERTa) mapping queries to capable models using existing labeled outputs.

Deploy WEIGHTEDMAXCONF: pick top-k models by confidence and apply confidence-weighted majority voting.

Measure per-query latency and accuracy vs your current ensemble; tune k to trade latency and accuracy.

Optimization Features

System Optimization

  • lower_per-query_model_calls
  • reduced_GPU_time

Inference Optimization

  • model_routing
  • subset_selection
  • confidence_weighted_voting

Reproducibility

Data Urls

  • GSM8K (public)
  • MMLU (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on discrete-answer reasoning benchmarks (GSM8K, MMLU), not open-ended generation.
  • Multi-label classifier trained on limited data (~7K GSM8K, ~14K MMLU), constraining selection accuracy.
  • Answer extraction viability is 92–95%; invalid outputs reduce usable data and can bias results.

When Not To Use

  • For open-ended text generation tasks without easy discrete voting rules.
  • When you cannot reliably extract discrete answers from model outputs.
  • When the model pool is tiny (k≤2) and routing overhead outweighs benefits.

Failure Modes

  • Classifier bias toward dominant labels (e.g., metamath-7b-lm) causing poor recall for other capable models.
  • Incorrect or non-extractable LLM outputs counted as INVALID reduce effective dataset and hurt selection.
  • Complex linguistic phenomena (quantifiers, units, ordinals, fractions) lead to errors across models and routing mistakes.

Core Entities

Models

  • gemma-7b-lm
  • metamath-7b-lm
  • mistral-7b-lm
  • mistral-7b-it
  • llama2-7b
  • llama2-13b-chat
  • gemma-7b-it
  • RoBERTa (MLC)
  • BERT (MLC)
  • T5 (MLC)

Metrics

  • Accuracy
  • Latency (seconds per query)
  • Weighted F1 (MLC)

Datasets

  • GSM8K
  • MMLU
  • SLDATA (constructed labels per LLM majority vote)

Benchmarks

  • Accuracy
  • Latency per query (sec)