Pick cheaper or stronger solvers per question to cut inference cost while keeping or improving reasoning accuracy.

October 1, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Jianpeng Zhou, Wanjun Zhong, Yanlin Wang, Jiahai Wang

Links

Abstract / PDF

Why It Matters For Business

Adaptive per-question solver selection cuts cloud API bills and lets teams trade a small latency increase for big cost or accuracy gains on reasoning workloads.

Summary TLDR

This paper introduces Adaptive-Solver (AS), a runtime framework that inspects an LLM's answer quality and, when needed, adapts the solving strategy (model, sample size, prompt, decomposition granularity). On eight reasoning datasets the method cuts API cost vs always-using GPT-4 by 46–85% while matching GPT-4 accuracy, or it raises accuracy by ≈4.5% at equal cost. The system uses a small validation set to build a short pipeline of solvers and a consistency-based stopping rule to avoid extra calls.

Problem Statement

Most LLM reasoning systems use one fixed solver (same model, prompt, sample size, decomposition) for all problems. That wastes budget on easy items and under-solves hard ones. The paper asks: can we dynamically allocate test-time compute—switching models, sample sizes, prompts, and decomposition—per question to reduce cost and improve accuracy?

Main Contribution

Adaptive-Solver (AS) framework: runtime evaluation + adaptation that selects solvers per question.

Four concrete adaptation levers: model routing, sample-size scheduling, prompting-method switching, and decomposition-granularity control.

Automatic pipeline-configuration algorithm that builds a short sequence of increasingly costly solvers using a small validation set and cached responses.

Extensive experiments across 8 reasoning datasets showing large cost savings and better cost–accuracy tradeoffs versus static baselines.

Key Findings

The Adaptive-Solver can cut API costs by a large margin while keeping GPT-4-level accuracy.

Numbers46%–85% cost reduction vs GPT-4

At the same cost budget, AS improves accuracy over static baselines.

NumbersGSM8K: 92.49% (AS) vs 87.93% (best static) → +4.56%

Multi-round adaptation adds little interactive overhead on average.

NumbersAverage calls ≈1.68; relative inference time ≈1.45× G4-Z-1

Each adaptation dimension matters; model adaptation is most critical for accuracy, and sample-size adaptation helps control cost.

NumbersAblation shows AS variant without model switches (AS-SPD) performs worst; fixing sample size at 10 (no sample adaptation

Decomposition granularity should match problem difficulty.

NumbersCoarse decomposition works better for <5-step problems; finer granularity helps with harder problems

Results

API cost reduction vs GPT-4

Value46%–85% lower API cost

BaselineAlways use GPT-4

Accuracy

Value92.49% (AS-MSPD)

BaselineG3.5-10-ZeroCoT 87.93%

Average solving rounds (API calls) per question

Value1.68

Baselinesingle-round methods (1)

Relative inference time

Value≈1.45× vs G4-1-ZeroCoT

BaselineG4-1-ZeroCoT

Effect of decomposition granularity

Valuecoarse better for <5-step problems; fine better for harder

Baselinesingle fixed granularity

Who Should Care

What To Try In 7 Days

Run a small validation set (50–200 samples) and cache multiple responses per (model,prompt) pair.

Implement a consistency-based stopping rule (self-consistency) and tune thresholds to your budget.

Build a 2–4 step pipeline: cheap model + small sample → medium prompt/sample → strong model fallback and measure cost/accuracy trade-offs.

Agent Features

Memory

  • caches validation responses to avoid extra API calls

Planning

  • automatic pipeline configuration (heuristic search)

Tool Use

  • uses multiple LLM APIs (model switching)

Frameworks

  • Adaptive-Solver (AS)

Architectures

  • pipeline of solvers (sequential routing)

Optimization Features

Token Efficiency

  • use smaller models and smaller sample sizes where possible

Model Optimization

  • model routing (cheap→strong fallback)

System Optimization

  • pre-saved response cache to avoid repeated API calls during pipeline search

Inference Optimization

  • adaptive sample sizing
  • prompt switching
  • early stopping via consistency thresholds

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires a validation set and pre-saved responses to configure pipelines.
  • Pipeline search needs many cached responses initially; configuration has nontrivial setup cost.
  • Multiround solving can increase latency for single-request low-latency use cases.

When Not To Use

  • When strict single-call low latency is required (real-time single-turn APIs).
  • If you cannot call multiple models or lack access to cheaper model variants.
  • When validation data is unavailable or too small to represent task difficulty.

Failure Modes

  • Mis-calibrated consistency thresholds cause premature stops or unnecessary fallbacks.
  • Pipeline overfits the validation set and underperforms on different test distributions.
  • Cheaper model(s) may systematically fail on categories unseen in validation, forcing frequent costly fallbacks.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo

Metrics

  • Accuracy
  • API cost
  • inference time
  • average solving rounds
  • consistency (self-consistency metric)

Datasets

  • GSM8K
  • SVAMP
  • AQuA
  • AddSub
  • SingleEq
  • MultiArith
  • CSQA
  • LLC

Context Entities

Models

  • GPT-4
  • GPT-3.5-turbo

Metrics

  • Accuracy
  • relative API cost

Datasets

  • GSM8K
  • SVAMP