Overview
Method is practical: a small classifier plus confidence-based selection gives measurable latency and accuracy gains on two public reasoning benchmarks, but gains depend on classifier quality and discrete-answer extraction.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.
Who Should Care
Summary TLDR
SelectLLM is a practical system that uses a lightweight multi-label classifier to predict which LLMs in a pool can solve a given query, then applies a confidence-weighted selection policy (WEIGHTEDMAXCONF) to call only that subset. On two reasoning benchmarks the method matches or beats ensemble baselines while lowering latency: +1.90 accuracy on GSM8K and +4.89 on MMLU vs. the All-LLMs baseline, and reduces inference time by ~13% (GSM8K) and ~70% (MMLU) compared to top-performing ensemble baselines. Limits: focused on discrete-answer reasoning tasks, relies on a classifier with modest F1 (0.71/0.68) and on extracting discrete answers from LLM outputs.
Problem Statement
Using many LLMs can improve accuracy but querying all models for every input is slow and costly. We need a fast, query-aware way to pick a small subset of LLMs that together give a correct answer while reducing inference latency.
Main Contribution
A query-aware selection algorithm (SELECTLLM) that uses a multi-label classifier to predict which LLMs can solve each input and a confidence-based policy to pick a small subset.
A new confidence-weighted ensembling policy (WEIGHTEDMAXCONF) that adjusts majority-vote counts by model confidences to reduce bias.
Key Findings
SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.
SELECTLLM cuts inference latency substantially compared to top-performing ensemble baselines.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 77.94% | All LLMs 76.04% | +1.90 | GSM8K test | SELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMs | Table 2 |
| Accuracy | 65.81% | All LLMs 60.92% | +4.89 | MMLU test | SELECTLLM MLC + WEIGHTEDMAXCONF vs All LLMs | Table 2 |
What To Try In 7 Days
Build a small multi-label classifier (RoBERTa) mapping queries to capable models using existing labeled outputs.
Deploy WEIGHTEDMAXCONF: pick top-k models by confidence and apply confidence-weighted majority voting.
Measure per-query latency and accuracy vs your current ensemble; tune k to trade latency and accuracy.
Optimization Features
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluated only on discrete-answer reasoning benchmarks (GSM8K, MMLU), not open-ended generation.
Multi-label classifier trained on limited data (~7K GSM8K, ~14K MMLU), constraining selection accuracy.
When Not To Use
For open-ended text generation tasks without easy discrete voting rules.
When you cannot reliably extract discrete answers from model outputs.
Failure Modes
Classifier bias toward dominant labels (e.g., metamath-7b-lm) causing poor recall for other capable models.
Incorrect or non-extractable LLM outputs counted as INVALID reduce effective dataset and hurt selection.

