An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

May 26, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

15

Authors

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot

Links

Abstract / PDF

Why It Matters For Business

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Summary TLDR

The authors build Chain-of-Thought Hub (CoT Hub), an open evaluation suite that tracks LLM reasoning across six established benchmarks using few-shot chain-of-thought prompts. They run or collect results for ~19 major models (GPT, Claude, PaLM, LLaMA, Flan-T5 families) and report that reasoning performance scales with model size, that closed-source models (often tuned with RLHF) currently lead, and that LLaMA-65B is the strongest open model and a promising base for further alignment work. CoT Hub is positioned as a continuous, public benchmark for measuring multi-step reasoning.

Problem Statement

As LLMs rapidly evolve, practitioners need a focused, repeatable way to track multi-step reasoning abilities. Existing leaderboards mix many capabilities or use answer-only prompts; this work builds a continuously updated, chain-of-thought-focused evaluation suite to compare models on reasoning tasks.

Main Contribution

A public, continuously updated evaluation suite (CoT Hub) that aggregates reasoning benchmarks and model scores using few-shot chain-of-thought prompting.

A leaderboard comparing ~19 major LLM checkpoints (GPT, Claude, PaLM, LLaMA, Flan-T5) across six reasoning datasets and 100+ subtasks.

Practical observations: reasoning performance correlates with model scale; closed-source RLHF models lead; LLaMA-65B is the top open checkpoint and a promising base for alignment.

Key Findings

Reasoning performance scales with model size.

NumbersGSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

Top-performing models are usually RLHF-tuned.

NumbersTop GSM8k ranks: GPT-4 (RLHF) 92.0, claude-v1.3 (RLHF) 81.8*, PaLM-2 (base) 80.7

Open-source models lag behind closed-source leaders but have potential.

NumbersMMLU: code-davinci-002 64.5 vs LLaMA-65B 63.4; GSM8k: code-davinci-002 66.6 vs LLaMA-65B 50.9

CoT Hub evaluates with few-shot chain-of-thought prompts rather than answer-only prompts.

NumbersEvaluation uses few-shot CoT across datasets (method section)

Results

Accuracy

ValueGPT-4 92.0; claude-v1.3 81.8*; PaLM-2 80.7; gpt-3.5-turbo 74.9*; code-davinci-002 66.6; LLaMA-65B 50.9

Accuracy

ValueGPT-4 42.5; Minerva 33.6; PaLM-2 34.3; code-davinci-002 19.1; LLaMA-65B 10.6

Accuracy

ValueGPT-4 86.4; PaLM-2 78.3; claude-v1.3 74.8*; code-davinci-002 64.5; LLaMA-65B 63.4

HumanEval pass@1 (approx)

ValueGPT-4 67.0; gpt-3.5-turbo 48.1; code-davinci-002 47.0; LLaMA-65B 23.7

Accuracy

ValueGPT-4 68.7*; gpt-3.5-turbo 54.4*; claude-v1.3 54.2*; LLaMA-65B 38.8*

Who Should Care

What To Try In 7 Days

Run CoT few-shot prompts on GSM8k and MMLU for candidate models to spot reasoning gaps.

If using LLaMA-65B, prototype an SFT+RLHF alignment pipeline to test practical gains.

Add final-answer accuracy for reasoning tasks to your model acceptance criteria.

Optimization Features

Training Optimization

  • SFT
  • RL

Reproducibility

Data Urls

  • GSM8k
  • MATH
  • MMLU
  • BigBench Hard
  • HumanEval
  • C-Eval

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation uses final-answer accuracy only; intermediate step correctness is not scored.
  • Some models are not fully tested due to lack of public API or undisclosed scales.
  • Few-shot CoT setting favors models that respond well to chain-of-thought prompting and may not reflect zero-shot behavior.

When Not To Use

  • As the sole benchmark for truthfulness or safety evaluations.
  • To measure intermediate-step faithfulness — it reports final-answer accuracy only.
  • To compare models on non-reasoning tasks like pure generation quality.

Failure Modes

  • Correct final answer with incorrect intermediate steps (not detected).
  • Benchmarks may be gamed by prompt engineering rather than genuine reasoning.
  • Closed-source model scores may reflect undisclosed tuning, making apples-to-apples comparison hard.

Core Entities

Models

  • GPT-4
  • gpt-3.5-turbo
  • text-davinci-003
  • text-davinci-002
  • code-davinci-002
  • claude-v1.3
  • claude-instant-v1.0
  • PaLM-2
  • PaLM
  • Flan-PaLM
  • Flan-U-PaLM
  • Minerva
  • LLaMA-65B
  • LLaMA-33B
  • LLaMA-13B
  • LLaMA-7B
  • Flan-T5-11B
  • Flan-T5-3B

Metrics

  • Accuracy

Datasets

  • GSM8k
  • MATH
  • MMLU
  • BigBench Hard
  • HumanEval
  • C-Eval

Benchmarks

  • GSM8k
  • MATH
  • MMLU
  • BBH
  • HumanEval
  • C-Eval