Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Compare reasoning methods by token cost, not just accuracy; cheaper self-consistency often gives better accuracy-per-dollar and reduces deployment cost.
Summary TLDR
The paper argues that common comparisons of LLM reasoning strategies miss a key axis: compute budget. It proposes token-based budget metrics (tokens, queries, monetary cost) and evaluates seven reasoning methods across five datasets and multiple models. Main findings: chain-of-thought with self-consistency (CoT SC) is highly token-efficient and often matches or beats more complex strategies (MAD, Reflexion, Plan-and-Solve) when budgets are equal; Tree-of-Thoughts (ToT) can outperform but needs a much larger token budget and a strong model; LLM self-evaluation helps but is inconsistent across datasets. Practical takeaway: compare methods by token budget and test self-evaluation before using,;
Problem Statement
Existing comparisons of LLM reasoning strategies report performance without accounting for compute spent. This can hide that gains come from extra tokens/queries rather than algorithmic improvements. The paper asks: how do reasoning strategies compare when evaluated under equal compute budgets, and when is self-evaluation actually helpful?
Main Contribution
Proposes a budget-aware evaluation framework for reasoning strategies using queries, tokens, and monetary cost, and recommends tokens as the most holistic metric.
Empirically compares seven reasoning strategies across five datasets and multiple models (GPT-3.5, GPT-4, Mistral, LLaMA-2, Mixtral) under matched budgets.
Analyzes why self-consistency scales smoothly (independent sampling) and why dependent strategies like Multi-Agent Debate can plateau (loss of diversity).
Ablates Tree-of-Thoughts and Reflexion into proposer vs evaluator budgets and shows evaluator quality and token cost strongly shape returns.
Introduces Self-Confident Self-Consistency (SC2): weight majority votes by a binary self-evaluation signal; shows gains on some math tasks but inconsistent overall.
Key Findings
When token/query budgets are matched, chain-of-thought with self-consistency (CoT SC) often matches or outperforms more complex methods like Multi-Agent Debate and Reflexion.
LLM self-evaluation quality varies widely by dataset; GPT-4 binary self-eval total accuracy ranges from 0.547 to 0.937 across datasets.
Token-count is a better single metric than number-of-queries for capturing compute differences between strategies that vary response length.
Dependent sampling in multi-agent debate reduces output diversity across rounds, causing performance plateaus or decline.
Tree-of-Thoughts can beat baselines but requires a stronger model and much higher budget; evaluator quality matters for cost-effectiveness.
Results
Accuracy
Accuracy
Game of 24 ToT / CoT numeric example
Budget caps used in experiments
Who Should Care
What To Try In 7 Days
Benchmark your reasoning pipeline by tokens (input+output) and monetized cost, not only API calls.
Use CoT + self-consistency as a baseline and measure improvement per 1k tokens spent.
Run a small self-evaluation sanity check: measure binary self-eval accuracy on a validation subset before using it to gate outputs.
Optimization Features
Token Efficiency
- token_budgeting
- cache_encoded_inputs
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Selected reasoning strategies and datasets are representative but not exhaustive; results may not generalize to all tasks.
- Monetary and time constraints limited experiments and budget sweeps.
- Self-evaluation quality depends heavily on dataset difficulty; GPT-4 can be near-random on some hard tasks.
When Not To Use
- Do not assume self-evaluation works as an oracle on very hard, theorem-level tasks.
- Avoid deploying ToT unless you have both a high-quality model and a budget that justifies the extra tokens.
Failure Modes
- Dependent multi-round methods (MAD, Reflexion) can reduce diversity and tunnel into wrong answers as budget increases.
- Self-evaluation can be miscalibrated and may reinforce wrong majority answers if model confidence is misplaced.
- Comparing by queries alone can mislead teams about real compute and cost savings.
Core Entities
Models
- GPT-3.5
- GPT-4
- Mistral-7B-Instruct-v0.2
- LLaMA-2-70b-chat
- Mixtral-8x7B-Instruct-v0.1
Metrics
- total_tokens
- number_of_queries
- monetary_cost
- performance@tokens
- Accuracy
Datasets
- GSM8K
- MATH
- TheoremQA
- CSQA
- HotpotQA
- Game of 24

