Match reasoning strategies by compute: token-budgeted evaluation shows simple self-consistency often beats complex methods

June 10, 20248 min

Overview

Decision SnapshotReady For Pilot

The evaluation framework and experiments are practically useful: token-based budgeting is ready to adopt; empirical evidence across models/datasets supports conclusions but is limited by chosen tasks and budgets.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun

Links

Abstract / PDF

Why It Matters For Business

Compare reasoning methods by token cost, not just accuracy; cheaper self-consistency often gives better accuracy-per-dollar and reduces deployment cost.

Who Should Care

Summary TLDR

The paper argues that common comparisons of LLM reasoning strategies miss a key axis: compute budget. It proposes token-based budget metrics (tokens, queries, monetary cost) and evaluates seven reasoning methods across five datasets and multiple models. Main findings: chain-of-thought with self-consistency (CoT SC) is highly token-efficient and often matches or beats more complex strategies (MAD, Reflexion, Plan-and-Solve) when budgets are equal; Tree-of-Thoughts (ToT) can outperform but needs a much larger token budget and a strong model; LLM self-evaluation helps but is inconsistent across datasets. Practical takeaway: compare methods by token budget and test self-evaluation before using,;

Problem Statement

Existing comparisons of LLM reasoning strategies report performance without accounting for compute spent. This can hide that gains come from extra tokens/queries rather than algorithmic improvements. The paper asks: how do reasoning strategies compare when evaluated under equal compute budgets, and when is self-evaluation actually helpful?

Main Contribution

Proposes a budget-aware evaluation framework for reasoning strategies using queries, tokens, and monetary cost, and recommends tokens as the most holistic metric.

Empirically compares seven reasoning strategies across five datasets and multiple models (GPT-3.5, GPT-4, Mistral, LLaMA-2, Mixtral) under matched budgets.

Key Findings

When token/query budgets are matched, chain-of-thought with self-consistency (CoT SC) often matches or outperforms more complex methods like Multi-Agent Debate and Reflexion.

NumbersExperiments run up to 20 queries or 10k tokens; SC outperforms MAD/Reflexion across 5 datasets except HotpotQA

Practical UseWhen testing or deploying reasoning pipelines, compare methods at equal token budgets; favor CoT+self-consistency as a strong, low-compute baseline.

Evidence RefSections 4.1, Figures 1,3

LLM self-evaluation quality varies widely by dataset; GPT-4 binary self-eval total accuracy ranges from 0.547 to 0.937 across datasets.

NumbersGPT-4 self-eval total accuracy: GSM8K 0.937, MATH 0.707, TheoremQA 0.547, HotpotQA 0.675, CSQA 0.901

Practical UseDo not treat LLM self-evaluation as an oracle. Use it on easier tasks (math word problems, GSM8K) but validate on hard tasks (TheoremQA) before relying on it to control generations.

Evidence RefSection 5.4, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGSM8K 0.937; MATH 0.707; TheoremQA 0.547; HotpotQA 0.675; CSQA 0.901Table 1, GPT-4Section 5.4, Table 1Table 1
AccuracyGPT-4 proposer + GPT-3.5 evaluator: 72% acc at $33.53; GPT-4 evaluator: 76% acc at $159.87CoT SC≈+4ppt accuracy for ≈5x costGame of 24Section 5.2, Figure 8Figure 8

What To Try In 7 Days

Benchmark your reasoning pipeline by tokens (input+output) and monetized cost, not only API calls.

Use CoT + self-consistency as a baseline and measure improvement per 1k tokens spent.

Run a small self-evaluation sanity check: measure binary self-eval accuracy on a validation subset before using it to gate outputs.

Optimization Features

Token Efficiency
token_budgetingcache_encoded_inputs

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Selected reasoning strategies and datasets are representative but not exhaustive; results may not generalize to all tasks.

Monetary and time constraints limited experiments and budget sweeps.

When Not To Use

Do not assume self-evaluation works as an oracle on very hard, theorem-level tasks.

Avoid deploying ToT unless you have both a high-quality model and a budget that justifies the extra tokens.

Failure Modes

Dependent multi-round methods (MAD, Reflexion) can reduce diversity and tunnel into wrong answers as budget increases.

Self-evaluation can be miscalibrated and may reinforce wrong majority answers if model confidence is misplaced.

Core Entities

Models

GPT-3.5GPT-4Mistral-7B-Instruct-v0.2LLaMA-2-70b-chatMixtral-8x7B-Instruct-v0.1

Metrics

total_tokensnumber_of_queriesmonetary_costperformance@tokensAccuracy

Datasets

GSM8KMATHTheoremQACSQAHotpotQAGame of 24