Overview
This paper provides a practical, reproducible leaderboard for multi-step reasoning using few-shot chain-of-thought prompts; use it to compare models, but remember closed-source scores may reflect proprietary tuning and not public checkpoints.
Citations15
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.
Who Should Care
Summary TLDR
The authors build Chain-of-Thought Hub (CoT Hub), an open evaluation suite that tracks LLM reasoning across six established benchmarks using few-shot chain-of-thought prompts. They run or collect results for ~19 major models (GPT, Claude, PaLM, LLaMA, Flan-T5 families) and report that reasoning performance scales with model size, that closed-source models (often tuned with RLHF) currently lead, and that LLaMA-65B is the strongest open model and a promising base for further alignment work. CoT Hub is positioned as a continuous, public benchmark for measuring multi-step reasoning.
Problem Statement
As LLMs rapidly evolve, practitioners need a focused, repeatable way to track multi-step reasoning abilities. Existing leaderboards mix many capabilities or use answer-only prompts; this work builds a continuously updated, chain-of-thought-focused evaluation suite to compare models on reasoning tasks.
Main Contribution
A public, continuously updated evaluation suite (CoT Hub) that aggregates reasoning benchmarks and model scores using few-shot chain-of-thought prompting.
A leaderboard comparing ~19 major LLM checkpoints (GPT, Claude, PaLM, LLaMA, Flan-T5) across six reasoning datasets and 100+ subtasks.
Key Findings
Reasoning performance scales with model size.
Top-performing models are usually RLHF-tuned.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 92.0; claude-v1.3 81.8*; PaLM-2 80.7; gpt-3.5-turbo 74.9*; code-davinci-002 66.6; LLaMA-65B 50.9 | — | — | GSM8k | Table 1 GSM8k column | Table 1 |
| Accuracy | GPT-4 42.5; Minerva 33.6; PaLM-2 34.3; code-davinci-002 19.1; LLaMA-65B 10.6 | — | — | MATH | Table 1 MATH column | Table 1 |
What To Try In 7 Days
Run CoT few-shot prompts on GSM8k and MMLU for candidate models to spot reasoning gaps.
If using LLaMA-65B, prototype an SFT+RLHF alignment pipeline to test practical gains.
Add final-answer accuracy for reasoning tasks to your model acceptance criteria.
Optimization Features
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation uses final-answer accuracy only; intermediate step correctness is not scored.
Some models are not fully tested due to lack of public API or undisclosed scales.
When Not To Use
As the sole benchmark for truthfulness or safety evaluations.
To measure intermediate-step faithfulness — it reports final-answer accuracy only.
Failure Modes
Correct final answer with incorrect intermediate steps (not detected).
Benchmarks may be gamed by prompt engineering rather than genuine reasoning.

