SelectLLM routes each query to a small subset of LLMs to keep accuracy high while cutting inference latency.

August 16, 20247 min

Overview

Decision SnapshotReady For Pilot

Method is practical: a small classifier plus confidence-based selection gives measurable latency and accuracy gains on two public reasoning benchmarks, but gains depend on classifier quality and discrete-answer extraction.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.

Who Should Care

Summary TLDR

SelectLLM is a practical system that uses a lightweight multi-label classifier to predict which LLMs in a pool can solve a given query, then applies a confidence-weighted selection policy (WEIGHTEDMAXCONF) to call only that subset. On two reasoning benchmarks the method matches or beats ensemble baselines while lowering latency: +1.90 accuracy on GSM8K and +4.89 on MMLU vs. the All-LLMs baseline, and reduces inference time by ~13% (GSM8K) and ~70% (MMLU) compared to top-performing ensemble baselines. Limits: focused on discrete-answer reasoning tasks, relies on a classifier with modest F1 (0.71/0.68) and on extracting discrete answers from LLM outputs.

Problem Statement

Using many LLMs can improve accuracy but querying all models for every input is slow and costly. We need a fast, query-aware way to pick a small subset of LLMs that together give a correct answer while reducing inference latency.

Main Contribution

A query-aware selection algorithm (SELECTLLM) that uses a multi-label classifier to predict which LLMs can solve each input and a confidence-based policy to pick a small subset.

A new confidence-weighted ensembling policy (WEIGHTEDMAXCONF) that adjusts majority-vote counts by model confidences to reduce bias.

Key Findings

SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.

NumbersGSM8K: 76.0477.94 (+1.90); MMLU: 60.9265.81 (+4.89)

Practical UseIf you already ensemble many LLMs, replacing full ensembles with SELECTLLM can raise accuracy modestly while calling fewer models.

Evidence RefTable 2

SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.

NumbersGSM8K latency 19.0016.50 (≈13% drop); MMLU 16.404.78 (≈70% drop)

Practical UseUse SELECTLLM when you need similar or better accuracy but want 10–70% lower per-query latency on these reasoning tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.94%All LLMs 76.04%+1.90GSM8K testSELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMsTable 2
Accuracy65.81%All LLMs 60.92%+4.89MMLU testSELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMsTable 2

What To Try In 7 Days

Build a small multi-label classifier (RoBERTa) mapping queries to capable models using existing labeled outputs.

Deploy WEIGHTEDMAXCONF: pick top-k models by confidence and apply confidence-weighted majority voting.

Measure per-query latency and accuracy vs your current ensemble; tune k to trade latency and accuracy.

Optimization Features

System Optimization
lower_per-query_model_callsreduced_GPU_time
Inference Optimization
model_routingsubset_selectionconfidence_weighted_voting

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GSM8K (public)MMLU (public)

Risks & Boundaries

Limitations

Evaluated only on discrete-answer reasoning benchmarks (GSM8K, MMLU), not open-ended generation.

Multi-label classifier trained on limited data (~7K GSM8K, ~14K MMLU), constraining selection accuracy.

When Not To Use

For open-ended text generation tasks without easy discrete voting rules.

When you cannot reliably extract discrete answers from model outputs.

Failure Modes

Classifier bias toward dominant labels (e.g., metamath-7b-lm) causing poor recall for other capable models.

Incorrect or non-extractable LLM outputs counted as INVALID reduce effective dataset and hurt selection.

Core Entities

Models

gemma-7b-lmmetamath-7b-lmmistral-7b-lmmistral-7b-itllama2-7bllama2-13b-chatgemma-7b-itRoBERTa (MLC)BERT (MLC)T5 (MLC)

Metrics

AccuracyLatency (seconds per query)Weighted F1 (MLC)

Datasets

GSM8KMMLUSLDATA (constructed labels per LLM majority vote)

Benchmarks

AccuracyLatency per query (sec)