Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
15
Why It Matters For Business
Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.
Summary TLDR
The authors build Chain-of-Thought Hub (CoT Hub), an open evaluation suite that tracks LLM reasoning across six established benchmarks using few-shot chain-of-thought prompts. They run or collect results for ~19 major models (GPT, Claude, PaLM, LLaMA, Flan-T5 families) and report that reasoning performance scales with model size, that closed-source models (often tuned with RLHF) currently lead, and that LLaMA-65B is the strongest open model and a promising base for further alignment work. CoT Hub is positioned as a continuous, public benchmark for measuring multi-step reasoning.
Problem Statement
As LLMs rapidly evolve, practitioners need a focused, repeatable way to track multi-step reasoning abilities. Existing leaderboards mix many capabilities or use answer-only prompts; this work builds a continuously updated, chain-of-thought-focused evaluation suite to compare models on reasoning tasks.
Main Contribution
A public, continuously updated evaluation suite (CoT Hub) that aggregates reasoning benchmarks and model scores using few-shot chain-of-thought prompting.
A leaderboard comparing ~19 major LLM checkpoints (GPT, Claude, PaLM, LLaMA, Flan-T5) across six reasoning datasets and 100+ subtasks.
Practical observations: reasoning performance correlates with model scale; closed-source RLHF models lead; LLaMA-65B is the top open checkpoint and a promising base for alignment.
Key Findings
Reasoning performance scales with model size.
Top-performing models are usually RLHF-tuned.
Open-source models lag behind closed-source leaders but have potential.
CoT Hub evaluates with few-shot chain-of-thought prompts rather than answer-only prompts.
Results
Accuracy
Accuracy
Accuracy
HumanEval pass@1 (approx)
Accuracy
Who Should Care
What To Try In 7 Days
Run CoT few-shot prompts on GSM8k and MMLU for candidate models to spot reasoning gaps.
If using LLaMA-65B, prototype an SFT+RLHF alignment pipeline to test practical gains.
Add final-answer accuracy for reasoning tasks to your model acceptance criteria.
Optimization Features
Training Optimization
- SFT
- RL
Reproducibility
Data Urls
- GSM8k
- MATH
- MMLU
- BigBench Hard
- HumanEval
- C-Eval
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation uses final-answer accuracy only; intermediate step correctness is not scored.
- Some models are not fully tested due to lack of public API or undisclosed scales.
- Few-shot CoT setting favors models that respond well to chain-of-thought prompting and may not reflect zero-shot behavior.
When Not To Use
- As the sole benchmark for truthfulness or safety evaluations.
- To measure intermediate-step faithfulness — it reports final-answer accuracy only.
- To compare models on non-reasoning tasks like pure generation quality.
Failure Modes
- Correct final answer with incorrect intermediate steps (not detected).
- Benchmarks may be gamed by prompt engineering rather than genuine reasoning.
- Closed-source model scores may reflect undisclosed tuning, making apples-to-apples comparison hard.
Core Entities
Models
- GPT-4
- gpt-3.5-turbo
- text-davinci-003
- text-davinci-002
- code-davinci-002
- claude-v1.3
- claude-instant-v1.0
- PaLM-2
- PaLM
- Flan-PaLM
- Flan-U-PaLM
- Minerva
- LLaMA-65B
- LLaMA-33B
- LLaMA-13B
- LLaMA-7B
- Flan-T5-11B
- Flan-T5-3B
Metrics
- Accuracy
Datasets
- GSM8k
- MATH
- MMLU
- BigBench Hard
- HumanEval
- C-Eval
Benchmarks
- GSM8k
- MATH
- MMLU
- BBH
- HumanEval
- C-Eval

