An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

Overview

Decision SnapshotReady For Pilot

This paper provides a practical, reproducible leaderboard for multi-step reasoning using few-shot chain-of-thought prompts; use it to compare models, but remember closed-source scores may reflect proprietary tuning and not public checkpoints.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

The authors build Chain-of-Thought Hub (CoT Hub), an open evaluation suite that tracks LLM reasoning across six established benchmarks using few-shot chain-of-thought prompts. They run or collect results for ~19 major models (GPT, Claude, PaLM, LLaMA, Flan-T5 families) and report that reasoning performance scales with model size, that closed-source models (often tuned with RLHF) currently lead, and that LLaMA-65B is the strongest open model and a promising base for further alignment work. CoT Hub is positioned as a continuous, public benchmark for measuring multi-step reasoning.

Problem Statement

As LLMs rapidly evolve, practitioners need a focused, repeatable way to track multi-step reasoning abilities. Existing leaderboards mix many capabilities or use answer-only prompts; this work builds a continuously updated, chain-of-thought-focused evaluation suite to compare models on reasoning tasks.

Main Contribution

A public, continuously updated evaluation suite (CoT Hub) that aggregates reasoning benchmarks and model scores using few-shot chain-of-thought prompting.

A leaderboard comparing ~19 major LLM checkpoints (GPT, Claude, PaLM, LLaMA, Flan-T5) across six reasoning datasets and 100+ subtasks.

Key Findings

Reasoning performance scales with model size.

NumbersGSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

Practical UseTo improve multi-step reasoning, prioritize larger base models or scale-up trials; expect a roughly log-linear gain with size on these benchmarks.

Evidence RefTable 1

Top-performing models are usually RLHF-tuned.

NumbersTop GSM8k ranks: GPT-4 (RLHF) 92.0, claude-v1.3 (RLHF) 81.8*, PaLM-2 (base) 80.7

Practical UseInvest in supervised finetuning and reinforcement learning from human feedback (RLHF) to close gaps between open-source and leading models.

Evidence RefTable 1 and discussion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 92.0; claude-v1.3 81.8; PaLM-2 80.7; gpt-3.5-turbo 74.9; code-davinci-002 66.6; LLaMA-65B 50.9	—	—	GSM8k	Table 1 GSM8k column	Table 1
Accuracy	GPT-4 42.5; Minerva 33.6; PaLM-2 34.3; code-davinci-002 19.1; LLaMA-65B 10.6	—	—	MATH	Table 1 MATH column	Table 1

What To Try In 7 Days

Run CoT few-shot prompts on GSM8k and MMLU for candidate models to spot reasoning gaps.

If using LLaMA-65B, prototype an SFT+RLHF alignment pipeline to test practical gains.

Add final-answer accuracy for reasoning tasks to your model acceptance criteria.

Optimization Features

Training Optimization

SFTRL

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/FranxYao/chain-of-thought-hub

Data URLs

GSM8kMATHMMLUBigBench HardHumanEvalC-Eval

Risks & Boundaries

Limitations

Evaluation uses final-answer accuracy only; intermediate step correctness is not scored.

Some models are not fully tested due to lack of public API or undisclosed scales.

When Not To Use

As the sole benchmark for truthfulness or safety evaluations.

To measure intermediate-step faithfulness — it reports final-answer accuracy only.

Failure Modes

Correct final answer with incorrect intermediate steps (not detected).

Benchmarks may be gamed by prompt engineering rather than genuine reasoning.

Core Entities

Models

GPT-4gpt-3.5-turbotext-davinci-003text-davinci-002code-davinci-002claude-v1.3claude-instant-v1.0PaLM-2PaLMFlan-PaLMFlan-U-PaLMMinervaLLaMA-65BLLaMA-33BLLaMA-13BLLaMA-7BFlan-T5-11BFlan-T5-3B

Metrics

Accuracy

Datasets

GSM8kMATHMMLUBigBench HardHumanEvalC-Eval

Benchmarks

GSM8kMATHMMLUBBHHumanEvalC-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Reasoning performance scales with model size.

Top-performing models are usually RLHF-tuned.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding