Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.
Summary TLDR
SelectLLM is a practical system that uses a lightweight multi-label classifier to predict which LLMs in a pool can solve a given query, then applies a confidence-weighted selection policy (WEIGHTEDMAXCONF) to call only that subset. On two reasoning benchmarks the method matches or beats ensemble baselines while lowering latency: +1.90 accuracy on GSM8K and +4.89 on MMLU vs. the All-LLMs baseline, and reduces inference time by ~13% (GSM8K) and ~70% (MMLU) compared to top-performing ensemble baselines. Limits: focused on discrete-answer reasoning tasks, relies on a classifier with modest F1 (0.71/0.68) and on extracting discrete answers from LLM outputs.
Problem Statement
Using many LLMs can improve accuracy but querying all models for every input is slow and costly. We need a fast, query-aware way to pick a small subset of LLMs that together give a correct answer while reducing inference latency.
Main Contribution
A query-aware selection algorithm (SELECTLLM) that uses a multi-label classifier to predict which LLMs can solve each input and a confidence-based policy to pick a small subset.
A new confidence-weighted ensembling policy (WEIGHTEDMAXCONF) that adjusts majority-vote counts by model confidences to reduce bias.
Evaluation on GSM8K and MMLU showing competitive accuracy with much lower latency and analysis of an Oracle upper bound and linguistic failure modes.
Key Findings
SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.
SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.
A small classifier is the core routing component but its quality limits gains.
Oracle upper bound is far higher than current method, indicating room for improvement.
SELECTLLM struggles with linguistic phenomena that increase difficulty across models.
Results
Accuracy
Accuracy
Latency (sec per query)
Latency (sec per query)
MLC weighted F1
Accuracy
Who Should Care
What To Try In 7 Days
Build a small multi-label classifier (RoBERTa) mapping queries to capable models using existing labeled outputs.
Deploy WEIGHTEDMAXCONF: pick top-k models by confidence and apply confidence-weighted majority voting.
Measure per-query latency and accuracy vs your current ensemble; tune k to trade latency and accuracy.
Optimization Features
System Optimization
- lower_per-query_model_calls
- reduced_GPU_time
Inference Optimization
- model_routing
- subset_selection
- confidence_weighted_voting
Reproducibility
Data Urls
- GSM8K (public)
- MMLU (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on discrete-answer reasoning benchmarks (GSM8K, MMLU), not open-ended generation.
- Multi-label classifier trained on limited data (~7K GSM8K, ~14K MMLU), constraining selection accuracy.
- Answer extraction viability is 92–95%; invalid outputs reduce usable data and can bias results.
When Not To Use
- For open-ended text generation tasks without easy discrete voting rules.
- When you cannot reliably extract discrete answers from model outputs.
- When the model pool is tiny (k≤2) and routing overhead outweighs benefits.
Failure Modes
- Classifier bias toward dominant labels (e.g., metamath-7b-lm) causing poor recall for other capable models.
- Incorrect or non-extractable LLM outputs counted as INVALID reduce effective dataset and hurt selection.
- Complex linguistic phenomena (quantifiers, units, ordinals, fractions) lead to errors across models and routing mistakes.
Core Entities
Models
- gemma-7b-lm
- metamath-7b-lm
- mistral-7b-lm
- mistral-7b-it
- llama2-7b
- llama2-13b-chat
- gemma-7b-it
- RoBERTa (MLC)
- BERT (MLC)
- T5 (MLC)
Metrics
- Accuracy
- Latency (seconds per query)
- Weighted F1 (MLC)
Datasets
- GSM8K
- MMLU
- SLDATA (constructed labels per LLM majority vote)
Benchmarks
- Accuracy
- Latency per query (sec)

