Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Adaptive per-question solver selection cuts cloud API bills and lets teams trade a small latency increase for big cost or accuracy gains on reasoning workloads.
Summary TLDR
This paper introduces Adaptive-Solver (AS), a runtime framework that inspects an LLM's answer quality and, when needed, adapts the solving strategy (model, sample size, prompt, decomposition granularity). On eight reasoning datasets the method cuts API cost vs always-using GPT-4 by 46–85% while matching GPT-4 accuracy, or it raises accuracy by ≈4.5% at equal cost. The system uses a small validation set to build a short pipeline of solvers and a consistency-based stopping rule to avoid extra calls.
Problem Statement
Most LLM reasoning systems use one fixed solver (same model, prompt, sample size, decomposition) for all problems. That wastes budget on easy items and under-solves hard ones. The paper asks: can we dynamically allocate test-time compute—switching models, sample sizes, prompts, and decomposition—per question to reduce cost and improve accuracy?
Main Contribution
Adaptive-Solver (AS) framework: runtime evaluation + adaptation that selects solvers per question.
Four concrete adaptation levers: model routing, sample-size scheduling, prompting-method switching, and decomposition-granularity control.
Automatic pipeline-configuration algorithm that builds a short sequence of increasingly costly solvers using a small validation set and cached responses.
Extensive experiments across 8 reasoning datasets showing large cost savings and better cost–accuracy tradeoffs versus static baselines.
Key Findings
The Adaptive-Solver can cut API costs by a large margin while keeping GPT-4-level accuracy.
At the same cost budget, AS improves accuracy over static baselines.
Multi-round adaptation adds little interactive overhead on average.
Each adaptation dimension matters; model adaptation is most critical for accuracy, and sample-size adaptation helps control cost.
Decomposition granularity should match problem difficulty.
Results
API cost reduction vs GPT-4
Accuracy
Average solving rounds (API calls) per question
Relative inference time
Effect of decomposition granularity
Who Should Care
What To Try In 7 Days
Run a small validation set (50–200 samples) and cache multiple responses per (model,prompt) pair.
Implement a consistency-based stopping rule (self-consistency) and tune thresholds to your budget.
Build a 2–4 step pipeline: cheap model + small sample → medium prompt/sample → strong model fallback and measure cost/accuracy trade-offs.
Agent Features
Memory
- caches validation responses to avoid extra API calls
Planning
- automatic pipeline configuration (heuristic search)
Tool Use
- uses multiple LLM APIs (model switching)
Frameworks
- Adaptive-Solver (AS)
Architectures
- pipeline of solvers (sequential routing)
Optimization Features
Token Efficiency
- use smaller models and smaller sample sizes where possible
Model Optimization
- model routing (cheap→strong fallback)
System Optimization
- pre-saved response cache to avoid repeated API calls during pipeline search
Inference Optimization
- adaptive sample sizing
- prompt switching
- early stopping via consistency thresholds
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Requires a validation set and pre-saved responses to configure pipelines.
- Pipeline search needs many cached responses initially; configuration has nontrivial setup cost.
- Multiround solving can increase latency for single-request low-latency use cases.
When Not To Use
- When strict single-call low latency is required (real-time single-turn APIs).
- If you cannot call multiple models or lack access to cheaper model variants.
- When validation data is unavailable or too small to represent task difficulty.
Failure Modes
- Mis-calibrated consistency thresholds cause premature stops or unnecessary fallbacks.
- Pipeline overfits the validation set and underperforms on different test distributions.
- Cheaper model(s) may systematically fail on categories unseen in validation, forcing frequent costly fallbacks.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
Metrics
- Accuracy
- API cost
- inference time
- average solving rounds
- consistency (self-consistency metric)
Datasets
- GSM8K
- SVAMP
- AQuA
- AddSub
- SingleEq
- MultiArith
- CSQA
- LLC
Context Entities
Models
- GPT-4
- GPT-3.5-turbo
Metrics
- Accuracy
- relative API cost
Datasets
- GSM8K
- SVAMP

