Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Overview

Decision SnapshotReady For Pilot

The suite is practical and reproducible for head-to-head comparisons; results are solid for black-box evaluation but limited by prompt parsing, some cited external numbers, and the changing behavior of web-based models.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 80%

Novelty: 40%

Authors

Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Standardized, reproducible evaluations reduce cherry-picking, reveal real capability gaps and stability risks (prompt sensitivity and seesaw regressions) so teams can pick models and tuning strategies with measurable trade-offs.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

GPT-Fathom is an open-source, reproducible evaluation suite (built on OpenAI Evals) that runs 10+ popular LLMs on 20+ public benchmarks under aligned settings. The suite uses black-box evaluation and studies the GPT lineage (GPT-3 → GPT-3.5 → GPT-4), prompt sensitivity, Chain-of-Thought (CoT) effects, in-context shot ablations, and impacts of code pretraining and SFT/RLHF. Key takeaways: GPT-4 shows large, broad gains; pretraining on code correlates with better reasoning and coding; SFT/RLHF mainly helps weaker bases but can incur an “alignment tax”; many models are highly prompt-sensitive; CoT markedly helps reasoning tasks like GSM8K.

Problem Statement

Existing leaderboards mix scores, settings and prompts, making comparisons unreliable. The field lacks a single, reproducible, aligned evaluation that (1) covers many capability dimensions, (2) compares legacy and modern models head-to-head, and (3) studies sensitivity to prompts, shots and decoding.

Main Contribution

An open-source, reproducible evaluation suite (GPT-Fathom) built on OpenAI Evals and GitHub release.

Aligned, head-to-head evaluation of 10+ closed/open LLMs on 20+ benchmarks across 7 capability categories.

Key Findings

GPT-4 substantially outperforms GPT-3 on many benchmarks.

NumbersGSM8K: GPT-4 92.1% vs davinci (GPT-3) 12.1%

Practical UsePrefer GPT-4 (or latest generation) when accuracy on reasoning/math matters; older GPT-3 models will underperform on these tasks.

Evidence RefTable 1 (GSM8K row)

Pretraining on code correlates with broad capability gains, including reasoning.

Numberscode-davinci-002 improves over davinci on multiple tasks (e.g., BBH, LAMBADA) as reported in Table 1

Practical UseIncluding code in pretraining can lift coding and reasoning skills; consider mixed data if those capabilities matter.

Evidence RefSection 4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	92.1% (gpt-4-0314) vs 12.1% (davinci)	davinci (GPT-3)	≈+80 percentage points	GSM8K, 8-shot CoT or settings in Table 1	Table 1 GSM8K row	Table 1
HumanEval pass@1	66.3% (gpt-4-0314) vs 0% (davinci)	davinci (GPT-3)	≈+66 percentage points	HumanEval, 0-shot pass@1	Table 1 HumanEval row	Table 1

What To Try In 7 Days

Clone GPT-Fathom repo and run the provided evaluation on 5 priority tasks to place your model on the same scale.

Run prompt-template robustness tests (2–3 variants) and report the worst-case score for key tasks.

Toggle CoT on reasoning tasks (GSM8K/BBH) and compare 1-shot vs few-shot to select production prompts.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/GPT-Fathom/GPT-Fathom https://github.com/openai/evals

Data URLs

Public benchmark datasets referenced in paper (e.g., MMLU, GSM8K, HumanEval, MBPP, TriviaQA)

Risks & Boundaries

Limitations

Answer extraction uses regular expressions and can miss valid model outputs.

Black-box evaluation does not use token-level likelihoods; white-box metrics are not available for closed models.

When Not To Use

If you need white-box likelihood comparisons or per-token scoring (requires model internals).

If your use case requires exhaustive stability sweeps beyond the paper's ablations.

Failure Modes

Prompt-template sensitivity causing large score swings in practice.

Sampling variance at nonzero temperature undermining reproducibility for some tasks.

Core Entities

Models

davinci (GPT-3)davinci-instruct-beta (InstructGPT)text-davinci-001code-cushman-001 (Codex-12B)code-davinci-002text-davinci-002text-davinci-003gpt-3.5-turbo-0301gpt-3.5-turbo-0613gpt-4-0314gpt-4-0613gpt-4 Web-versiongpt-4 Advanced Data AnalysisPaLM 2-LClaude 2LLaMA-65BLlama 2-70B

Metrics

Exact Match (EM)Accuracypass@kF1

Datasets

MMLUGSM8KHumanEvalMBPPTriviaQANatural QuestionsWebQuestionsARC-eARC-cRACEDROPMATHBBHLAMBADAHellaSwagWinoGrandeAGIEvalC-EvalMGSMTyDi QATruthfulQARealToxicityPrompts

Benchmarks

Knowledge (TriviaQA/NQ/WebQuestions/MMLU/AGIEval/ARC)Reasoning (BBH/LAMBADA/HellaSwag/WinoGrande)Comprehension (RACE/DROP)Math (GSM8K/MATH)Coding (HumanEval/MBPP)Multilingual (AGIEval-ZH/C-Eval/MGSM/TyDi QA)Safety (TruthfulQA/RealToxicityPrompts)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 substantially outperforms GPT-3 on many benchmarks.

Pretraining on code correlates with broad capability gains, including reasoning.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding