An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

May 26, 20236 min

Overview

Decision SnapshotReady For Pilot

This paper provides a practical, reproducible leaderboard for multi-step reasoning using few-shot chain-of-thought prompts; use it to compare models, but remember closed-source scores may reflect proprietary tuning and not public checkpoints.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Who Should Care

Summary TLDR

The authors build Chain-of-Thought Hub (CoT Hub), an open evaluation suite that tracks LLM reasoning across six established benchmarks using few-shot chain-of-thought prompts. They run or collect results for ~19 major models (GPT, Claude, PaLM, LLaMA, Flan-T5 families) and report that reasoning performance scales with model size, that closed-source models (often tuned with RLHF) currently lead, and that LLaMA-65B is the strongest open model and a promising base for further alignment work. CoT Hub is positioned as a continuous, public benchmark for measuring multi-step reasoning.

Problem Statement

As LLMs rapidly evolve, practitioners need a focused, repeatable way to track multi-step reasoning abilities. Existing leaderboards mix many capabilities or use answer-only prompts; this work builds a continuously updated, chain-of-thought-focused evaluation suite to compare models on reasoning tasks.

Main Contribution

A public, continuously updated evaluation suite (CoT Hub) that aggregates reasoning benchmarks and model scores using few-shot chain-of-thought prompting.

A leaderboard comparing ~19 major LLM checkpoints (GPT, Claude, PaLM, LLaMA, Flan-T5) across six reasoning datasets and 100+ subtasks.

Key Findings

Reasoning performance scales with model size.

NumbersGSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

Practical UseTo improve multi-step reasoning, prioritize larger base models or scale-up trials; expect a roughly log-linear gain with size on these benchmarks.

Evidence RefTable 1

Top-performing models are usually RLHF-tuned.

NumbersTop GSM8k ranks: GPT-4 (RLHF) 92.0, claude-v1.3 (RLHF) 81.8*, PaLM-2 (base) 80.7

Practical UseInvest in supervised finetuning and reinforcement learning from human feedback (RLHF) to close gaps between open-source and leading models.

Evidence RefTable 1 and discussion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 92.0; claude-v1.3 81.8*; PaLM-2 80.7; gpt-3.5-turbo 74.9*; code-davinci-002 66.6; LLaMA-65B 50.9GSM8kTable 1 GSM8k columnTable 1
AccuracyGPT-4 42.5; Minerva 33.6; PaLM-2 34.3; code-davinci-002 19.1; LLaMA-65B 10.6MATHTable 1 MATH columnTable 1

What To Try In 7 Days

Run CoT few-shot prompts on GSM8k and MMLU for candidate models to spot reasoning gaps.

If using LLaMA-65B, prototype an SFT+RLHF alignment pipeline to test practical gains.

Add final-answer accuracy for reasoning tasks to your model acceptance criteria.

Optimization Features

Training Optimization
SFTRL

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

GSM8kMATHMMLUBigBench HardHumanEvalC-Eval

Risks & Boundaries

Limitations

Evaluation uses final-answer accuracy only; intermediate step correctness is not scored.

Some models are not fully tested due to lack of public API or undisclosed scales.

When Not To Use

As the sole benchmark for truthfulness or safety evaluations.

To measure intermediate-step faithfulness — it reports final-answer accuracy only.

Failure Modes

Correct final answer with incorrect intermediate steps (not detected).

Benchmarks may be gamed by prompt engineering rather than genuine reasoning.

Core Entities

Models

GPT-4gpt-3.5-turbotext-davinci-003text-davinci-002code-davinci-002claude-v1.3claude-instant-v1.0PaLM-2PaLMFlan-PaLMFlan-U-PaLMMinervaLLaMA-65BLLaMA-33BLLaMA-13BLLaMA-7BFlan-T5-11BFlan-T5-3B

Metrics

Accuracy

Datasets

GSM8kMATHMMLUBigBench HardHumanEvalC-Eval

Benchmarks

GSM8kMATHMMLUBBHHumanEvalC-Eval