Overview
The suite is practical and reproducible for head-to-head comparisons; results are solid for black-box evaluation but limited by prompt parsing, some cited external numbers, and the changing behavior of web-based models.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 80%
Novelty: 40%
Why It Matters For Business
Standardized, reproducible evaluations reduce cherry-picking, reveal real capability gaps and stability risks (prompt sensitivity and seesaw regressions) so teams can pick models and tuning strategies with measurable trade-offs.
Who Should Care
Summary TLDR
GPT-Fathom is an open-source, reproducible evaluation suite (built on OpenAI Evals) that runs 10+ popular LLMs on 20+ public benchmarks under aligned settings. The suite uses black-box evaluation and studies the GPT lineage (GPT-3 → GPT-3.5 → GPT-4), prompt sensitivity, Chain-of-Thought (CoT) effects, in-context shot ablations, and impacts of code pretraining and SFT/RLHF. Key takeaways: GPT-4 shows large, broad gains; pretraining on code correlates with better reasoning and coding; SFT/RLHF mainly helps weaker bases but can incur an “alignment tax”; many models are highly prompt-sensitive; CoT markedly helps reasoning tasks like GSM8K.
Problem Statement
Existing leaderboards mix scores, settings and prompts, making comparisons unreliable. The field lacks a single, reproducible, aligned evaluation that (1) covers many capability dimensions, (2) compares legacy and modern models head-to-head, and (3) studies sensitivity to prompts, shots and decoding.
Main Contribution
An open-source, reproducible evaluation suite (GPT-Fathom) built on OpenAI Evals and GitHub release.
Aligned, head-to-head evaluation of 10+ closed/open LLMs on 20+ benchmarks across 7 capability categories.
Key Findings
GPT-4 substantially outperforms GPT-3 on many benchmarks.
Pretraining on code correlates with broad capability gains, including reasoning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 92.1% (gpt-4-0314) vs 12.1% (davinci) | davinci (GPT-3) | ≈+80 percentage points | GSM8K, 8-shot CoT or settings in Table 1 | Table 1 GSM8K row | Table 1 |
| HumanEval pass@1 | 66.3% (gpt-4-0314) vs 0% (davinci) | davinci (GPT-3) | ≈+66 percentage points | HumanEval, 0-shot pass@1 | Table 1 HumanEval row | Table 1 |
What To Try In 7 Days
Clone GPT-Fathom repo and run the provided evaluation on 5 priority tasks to place your model on the same scale.
Run prompt-template robustness tests (2–3 variants) and report the worst-case score for key tasks.
Toggle CoT on reasoning tasks (GSM8K/BBH) and compare 1-shot vs few-shot to select production prompts.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Answer extraction uses regular expressions and can miss valid model outputs.
Black-box evaluation does not use token-level likelihoods; white-box metrics are not available for closed models.
When Not To Use
If you need white-box likelihood comparisons or per-token scoring (requires model internals).
If your use case requires exhaustive stability sweeps beyond the paper's ablations.
Failure Modes
Prompt-template sensitivity causing large score swings in practice.
Sampling variance at nonzero temperature undermining reproducibility for some tasks.

