Overview
Production Readiness
0.8
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Standardized, reproducible evaluations reduce cherry-picking, reveal real capability gaps and stability risks (prompt sensitivity and seesaw regressions) so teams can pick models and tuning strategies with measurable trade-offs.
Summary TLDR
GPT-Fathom is an open-source, reproducible evaluation suite (built on OpenAI Evals) that runs 10+ popular LLMs on 20+ public benchmarks under aligned settings. The suite uses black-box evaluation and studies the GPT lineage (GPT-3 → GPT-3.5 → GPT-4), prompt sensitivity, Chain-of-Thought (CoT) effects, in-context shot ablations, and impacts of code pretraining and SFT/RLHF. Key takeaways: GPT-4 shows large, broad gains; pretraining on code correlates with better reasoning and coding; SFT/RLHF mainly helps weaker bases but can incur an “alignment tax”; many models are highly prompt-sensitive; CoT markedly helps reasoning tasks like GSM8K.
Problem Statement
Existing leaderboards mix scores, settings and prompts, making comparisons unreliable. The field lacks a single, reproducible, aligned evaluation that (1) covers many capability dimensions, (2) compares legacy and modern models head-to-head, and (3) studies sensitivity to prompts, shots and decoding.
Main Contribution
An open-source, reproducible evaluation suite (GPT-Fathom) built on OpenAI Evals and GitHub release.
Aligned, head-to-head evaluation of 10+ closed/open LLMs on 20+ benchmarks across 7 capability categories.
Retrospective analysis of OpenAI's model evolution from GPT-3 to GPT-4 and empirical tests on code pretraining, SFT/RLHF, CoT, shots and prompt sensitivity.
Identification of practical issues: seesaw capability regressions, prompt sensitivity and alignment tax from tuning.
Key Findings
GPT-4 substantially outperforms GPT-3 on many benchmarks.
Pretraining on code correlates with broad capability gains, including reasoning.
SFT and RLHF mainly help weaker base models and can reduce some raw benchmark scores for stronger bases (alignment tax).
Chain-of-Thought (CoT) prompting strongly helps reasoning tasks but can harm certain knowledge tasks.
Prompt template and small prompt changes can drastically change scores, especially for open-source models.
Some capabilities show a seesaw: model updates can improve some tasks and regress others.
1-shot in-context examples usually provide most of the benefit; extra shots yield rapidly diminishing returns for strong models.
Results
Accuracy
HumanEval pass@1
Accuracy
Who Should Care
What To Try In 7 Days
Clone GPT-Fathom repo and run the provided evaluation on 5 priority tasks to place your model on the same scale.
Run prompt-template robustness tests (2–3 variants) and report the worst-case score for key tasks.
Toggle CoT on reasoning tasks (GSM8K/BBH) and compare 1-shot vs few-shot to select production prompts.
Reproducibility
Data Urls
- Public benchmark datasets referenced in paper (e.g., MMLU, GSM8K, HumanEval, MBPP, TriviaQA)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Answer extraction uses regular expressions and can miss valid model outputs.
- Black-box evaluation does not use token-level likelihoods; white-box metrics are not available for closed models.
- Some reported numbers are cited from other papers or vendor reports rather than own runs.
- Evaluation is finite: not every prompt variant, hyperparameter, or data split is covered.
When Not To Use
- If you need white-box likelihood comparisons or per-token scoring (requires model internals).
- If your use case requires exhaustive stability sweeps beyond the paper's ablations.
Failure Modes
- Prompt-template sensitivity causing large score swings in practice.
- Sampling variance at nonzero temperature undermining reproducibility for some tasks.
- Alignment tuning reducing raw capability (alignment tax) for some downstream tasks.
- Answer parsing misses when models deviate from expected output formats.
Core Entities
Models
- davinci (GPT-3)
- davinci-instruct-beta (InstructGPT)
- text-davinci-001
- code-cushman-001 (Codex-12B)
- code-davinci-002
- text-davinci-002
- text-davinci-003
- gpt-3.5-turbo-0301
- gpt-3.5-turbo-0613
- gpt-4-0314
- gpt-4-0613
- gpt-4 Web-version
- gpt-4 Advanced Data Analysis
- PaLM 2-L
- Claude 2
- LLaMA-65B
- Llama 2-70B
Metrics
- Exact Match (EM)
- Accuracy
- pass@k
- F1
Datasets
- MMLU
- GSM8K
- HumanEval
- MBPP
- TriviaQA
- Natural Questions
- WebQuestions
- ARC-e
- ARC-c
- RACE
- DROP
- MATH
- BBH
- LAMBADA
- HellaSwag
- WinoGrande
- AGIEval
- C-Eval
- MGSM
- TyDi QA
- TruthfulQA
- RealToxicityPrompts
Benchmarks
- Knowledge (TriviaQA/NQ/WebQuestions/MMLU/AGIEval/ARC)
- Reasoning (BBH/LAMBADA/HellaSwag/WinoGrande)
- Comprehension (RACE/DROP)
- Math (GSM8K/MATH)
- Coding (HumanEval/MBPP)
- Multilingual (AGIEval-ZH/C-Eval/MGSM/TyDi QA)
- Safety (TruthfulQA/RealToxicityPrompts)

