SelectLLM routes each query to a small subset of LLMs to keep accuracy high while cutting inference latency.

Overview

Decision SnapshotReady For Pilot

Method is practical: a small classifier plus confidence-based selection gives measurable latency and accuracy gains on two public reasoning benchmarks, but gains depend on classifier quality and discrete-answer extraction.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

SelectLLM is a practical system that uses a lightweight multi-label classifier to predict which LLMs in a pool can solve a given query, then applies a confidence-weighted selection policy (WEIGHTEDMAXCONF) to call only that subset. On two reasoning benchmarks the method matches or beats ensemble baselines while lowering latency: +1.90 accuracy on GSM8K and +4.89 on MMLU vs. the All-LLMs baseline, and reduces inference time by ~13% (GSM8K) and ~70% (MMLU) compared to top-performing ensemble baselines. Limits: focused on discrete-answer reasoning tasks, relies on a classifier with modest F1 (0.71/0.68) and on extracting discrete answers from LLM outputs.

Problem Statement

Using many LLMs can improve accuracy but querying all models for every input is slow and costly. We need a fast, query-aware way to pick a small subset of LLMs that together give a correct answer while reducing inference latency.

Main Contribution

A query-aware selection algorithm (SELECTLLM) that uses a multi-label classifier to predict which LLMs can solve each input and a confidence-based policy to pick a small subset.

A new confidence-weighted ensembling policy (WEIGHTEDMAXCONF) that adjusts majority-vote counts by model confidences to reduce bias.

Key Findings

SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.

NumbersGSM8K: 76.04→77.94 (+1.90); MMLU: 60.92→65.81 (+4.89)

Practical UseIf you already ensemble many LLMs, replacing full ensembles with SELECTLLM can raise accuracy modestly while calling fewer models.

Evidence RefTable 2

SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.

NumbersGSM8K latency 19.00→16.50 (≈13% drop); MMLU 16.40→4.78 (≈70% drop)

Practical UseUse SELECTLLM when you need similar or better accuracy but want 10–70% lower per-query latency on these reasoning tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.94%	All LLMs 76.04%	+1.90	GSM8K test	SELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMs	Table 2
Accuracy	65.81%	All LLMs 60.92%	+4.89	MMLU test	SELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMs	Table 2

What To Try In 7 Days

Build a small multi-label classifier (RoBERTa) mapping queries to capable models using existing labeled outputs.

Deploy WEIGHTEDMAXCONF: pick top-k models by confidence and apply confidence-weighted majority voting.

Measure per-query latency and accuracy vs your current ensemble; tune k to trade latency and accuracy.

Optimization Features

System Optimization

lower_per-query_model_callsreduced_GPU_time

Inference Optimization

model_routingsubset_selectionconfidence_weighted_voting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/kaushal0494/SelectLLM

Data URLs

GSM8K (public)MMLU (public)

Risks & Boundaries

Limitations

Evaluated only on discrete-answer reasoning benchmarks (GSM8K, MMLU), not open-ended generation.

Multi-label classifier trained on limited data (~7K GSM8K, ~14K MMLU), constraining selection accuracy.

When Not To Use

For open-ended text generation tasks without easy discrete voting rules.

When you cannot reliably extract discrete answers from model outputs.

Failure Modes

Classifier bias toward dominant labels (e.g., metamath-7b-lm) causing poor recall for other capable models.

Incorrect or non-extractable LLM outputs counted as INVALID reduce effective dataset and hurt selection.

Core Entities

Models

gemma-7b-lmmetamath-7b-lmmistral-7b-lmmistral-7b-itllama2-7bllama2-13b-chatgemma-7b-itRoBERTa (MLC)BERT (MLC)T5 (MLC)

Metrics

AccuracyLatency (seconds per query)Weighted F1 (MLC)

Datasets

GSM8KMMLUSLDATA (constructed labels per LLM majority vote)

Benchmarks

AccuracyLatency per query (sec)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.

SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding