Match reasoning strategies by compute: token-budgeted evaluation shows simple self-consistency often beats complex methods

June 10, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

1

Authors

Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun

Links

Abstract / PDF

Why It Matters For Business

Compare reasoning methods by token cost, not just accuracy; cheaper self-consistency often gives better accuracy-per-dollar and reduces deployment cost.

Summary TLDR

The paper argues that common comparisons of LLM reasoning strategies miss a key axis: compute budget. It proposes token-based budget metrics (tokens, queries, monetary cost) and evaluates seven reasoning methods across five datasets and multiple models. Main findings: chain-of-thought with self-consistency (CoT SC) is highly token-efficient and often matches or beats more complex strategies (MAD, Reflexion, Plan-and-Solve) when budgets are equal; Tree-of-Thoughts (ToT) can outperform but needs a much larger token budget and a strong model; LLM self-evaluation helps but is inconsistent across datasets. Practical takeaway: compare methods by token budget and test self-evaluation before using,;

Problem Statement

Existing comparisons of LLM reasoning strategies report performance without accounting for compute spent. This can hide that gains come from extra tokens/queries rather than algorithmic improvements. The paper asks: how do reasoning strategies compare when evaluated under equal compute budgets, and when is self-evaluation actually helpful?

Main Contribution

Proposes a budget-aware evaluation framework for reasoning strategies using queries, tokens, and monetary cost, and recommends tokens as the most holistic metric.

Empirically compares seven reasoning strategies across five datasets and multiple models (GPT-3.5, GPT-4, Mistral, LLaMA-2, Mixtral) under matched budgets.

Analyzes why self-consistency scales smoothly (independent sampling) and why dependent strategies like Multi-Agent Debate can plateau (loss of diversity).

Ablates Tree-of-Thoughts and Reflexion into proposer vs evaluator budgets and shows evaluator quality and token cost strongly shape returns.

Introduces Self-Confident Self-Consistency (SC2): weight majority votes by a binary self-evaluation signal; shows gains on some math tasks but inconsistent overall.

Key Findings

When token/query budgets are matched, chain-of-thought with self-consistency (CoT SC) often matches or outperforms more complex methods like Multi-Agent Debate and Reflexion.

NumbersExperiments run up to 20 queries or 10k tokens; SC outperforms MAD/Reflexion across 5 datasets except HotpotQA

LLM self-evaluation quality varies widely by dataset; GPT-4 binary self-eval total accuracy ranges from 0.547 to 0.937 across datasets.

NumbersGPT-4 self-eval total accuracy: GSM8K 0.937, MATH 0.707, TheoremQA 0.547, HotpotQA 0.675, CSQA 0.901

Token-count is a better single metric than number-of-queries for capturing compute differences between strategies that vary response length.

NumbersToken budgets used (e.g., 10k tokens cap). Example: a custom strategy appeared better by queries but worse by tokens (cf

Dependent sampling in multi-agent debate reduces output diversity across rounds, causing performance plateaus or decline.

NumbersEntropy of MAD answers declines each round (Figure 6); MAD overlaps SC up to 6 queries then lags as rounds continue

Tree-of-Thoughts can beat baselines but requires a stronger model and much higher budget; evaluator quality matters for cost-effectiveness.

NumbersGame of 24: ToT (GPT-4 proposer, GPT-3.5 evaluator) accuracy 72% at $33.53; GPT-4 evaluator raises cost to $159.87 for 4

Results

Accuracy

ValueGSM8K 0.937; MATH 0.707; TheoremQA 0.547; HotpotQA 0.675; CSQA 0.901

Accuracy

ValueGPT-4 proposer + GPT-3.5 evaluator: 72% acc at $33.53; GPT-4 evaluator: 76% acc at $159.87

BaselineCoT SC

Game of 24 ToT / CoT numeric example

ValueToT b=5 (GPT-4,GPT-4) Top1 0.74, Best 0.76, Total Acc 0.4; CoT 100 samples (GPT-4) Top1 0.17, Best 0.56, Total Acc 0.075

Budget caps used in experiments

ValueMax 20 queries or 10k tokens per question in core comparisons

Who Should Care

What To Try In 7 Days

Benchmark your reasoning pipeline by tokens (input+output) and monetized cost, not only API calls.

Use CoT + self-consistency as a baseline and measure improvement per 1k tokens spent.

Run a small self-evaluation sanity check: measure binary self-eval accuracy on a validation subset before using it to gate outputs.

Optimization Features

Token Efficiency

  • token_budgeting
  • cache_encoded_inputs

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Selected reasoning strategies and datasets are representative but not exhaustive; results may not generalize to all tasks.
  • Monetary and time constraints limited experiments and budget sweeps.
  • Self-evaluation quality depends heavily on dataset difficulty; GPT-4 can be near-random on some hard tasks.

When Not To Use

  • Do not assume self-evaluation works as an oracle on very hard, theorem-level tasks.
  • Avoid deploying ToT unless you have both a high-quality model and a budget that justifies the extra tokens.

Failure Modes

  • Dependent multi-round methods (MAD, Reflexion) can reduce diversity and tunnel into wrong answers as budget increases.
  • Self-evaluation can be miscalibrated and may reinforce wrong majority answers if model confidence is misplaced.
  • Comparing by queries alone can mislead teams about real compute and cost savings.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Mistral-7B-Instruct-v0.2
  • LLaMA-2-70b-chat
  • Mixtral-8x7B-Instruct-v0.1

Metrics

  • total_tokens
  • number_of_queries
  • monetary_cost
  • performance@tokens
  • Accuracy

Datasets

  • GSM8K
  • MATH
  • TheoremQA
  • CSQA
  • HotpotQA
  • Game of 24