Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

September 28, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

2

Authors

Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

Links

Abstract / PDF

Why It Matters For Business

Standardized, reproducible evaluations reduce cherry-picking, reveal real capability gaps and stability risks (prompt sensitivity and seesaw regressions) so teams can pick models and tuning strategies with measurable trade-offs.

Summary TLDR

GPT-Fathom is an open-source, reproducible evaluation suite (built on OpenAI Evals) that runs 10+ popular LLMs on 20+ public benchmarks under aligned settings. The suite uses black-box evaluation and studies the GPT lineage (GPT-3 → GPT-3.5 → GPT-4), prompt sensitivity, Chain-of-Thought (CoT) effects, in-context shot ablations, and impacts of code pretraining and SFT/RLHF. Key takeaways: GPT-4 shows large, broad gains; pretraining on code correlates with better reasoning and coding; SFT/RLHF mainly helps weaker bases but can incur an “alignment tax”; many models are highly prompt-sensitive; CoT markedly helps reasoning tasks like GSM8K.

Problem Statement

Existing leaderboards mix scores, settings and prompts, making comparisons unreliable. The field lacks a single, reproducible, aligned evaluation that (1) covers many capability dimensions, (2) compares legacy and modern models head-to-head, and (3) studies sensitivity to prompts, shots and decoding.

Main Contribution

An open-source, reproducible evaluation suite (GPT-Fathom) built on OpenAI Evals and GitHub release.

Aligned, head-to-head evaluation of 10+ closed/open LLMs on 20+ benchmarks across 7 capability categories.

Retrospective analysis of OpenAI's model evolution from GPT-3 to GPT-4 and empirical tests on code pretraining, SFT/RLHF, CoT, shots and prompt sensitivity.

Identification of practical issues: seesaw capability regressions, prompt sensitivity and alignment tax from tuning.

Key Findings

GPT-4 substantially outperforms GPT-3 on many benchmarks.

NumbersGSM8K: GPT-4 92.1% vs davinci (GPT-3) 12.1%

Pretraining on code correlates with broad capability gains, including reasoning.

Numberscode-davinci-002 improves over davinci on multiple tasks (e.g., BBH, LAMBADA) as reported in Table 1

SFT and RLHF mainly help weaker base models and can reduce some raw benchmark scores for stronger bases (alignment tax).

Numberstext-davinci-002/003 underperform code-davinci-002 on many benchmarks; SFT/RLHF boost pass@1 but can lower pass@100

Chain-of-Thought (CoT) prompting strongly helps reasoning tasks but can harm certain knowledge tasks.

NumbersGSM8K with CoT: gpt-4 92.1% vs without CoT 45.7% (8-shot comparison)

Prompt template and small prompt changes can drastically change scores, especially for open-source models.

NumbersLlama 2-70B TriviaQA: score dropped 74.0 → 55.5 with slight template change

Some capabilities show a seesaw: model updates can improve some tasks and regress others.

Numbersgpt-3.5-turbo-0613 improved coding but MATH dropped from 32.0→15.0; GPT-4 variants showed similar regresses on MGSM

1-shot in-context examples usually provide most of the benefit; extra shots yield rapidly diminishing returns for strong models.

Numbersgpt-4 ARC-c: 1-shot 94.9 vs 25-shot 95.6 (marginal gain)

Results

Accuracy

Value92.1% (gpt-4-0314) vs 12.1% (davinci)

Baselinedavinci (GPT-3)

HumanEval pass@1

Value66.3% (gpt-4-0314) vs 0% (davinci)

Baselinedavinci (GPT-3)

Accuracy

Value83.7% (gpt-4-0314) vs 67.8% (Llama 2-70B)

BaselineLlama 2-70B

Who Should Care

What To Try In 7 Days

Clone GPT-Fathom repo and run the provided evaluation on 5 priority tasks to place your model on the same scale.

Run prompt-template robustness tests (2–3 variants) and report the worst-case score for key tasks.

Toggle CoT on reasoning tasks (GSM8K/BBH) and compare 1-shot vs few-shot to select production prompts.

Reproducibility

Data Urls

  • Public benchmark datasets referenced in paper (e.g., MMLU, GSM8K, HumanEval, MBPP, TriviaQA)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Answer extraction uses regular expressions and can miss valid model outputs.
  • Black-box evaluation does not use token-level likelihoods; white-box metrics are not available for closed models.
  • Some reported numbers are cited from other papers or vendor reports rather than own runs.
  • Evaluation is finite: not every prompt variant, hyperparameter, or data split is covered.

When Not To Use

  • If you need white-box likelihood comparisons or per-token scoring (requires model internals).
  • If your use case requires exhaustive stability sweeps beyond the paper's ablations.

Failure Modes

  • Prompt-template sensitivity causing large score swings in practice.
  • Sampling variance at nonzero temperature undermining reproducibility for some tasks.
  • Alignment tuning reducing raw capability (alignment tax) for some downstream tasks.
  • Answer parsing misses when models deviate from expected output formats.

Core Entities

Models

  • davinci (GPT-3)
  • davinci-instruct-beta (InstructGPT)
  • text-davinci-001
  • code-cushman-001 (Codex-12B)
  • code-davinci-002
  • text-davinci-002
  • text-davinci-003
  • gpt-3.5-turbo-0301
  • gpt-3.5-turbo-0613
  • gpt-4-0314
  • gpt-4-0613
  • gpt-4 Web-version
  • gpt-4 Advanced Data Analysis
  • PaLM 2-L
  • Claude 2
  • LLaMA-65B
  • Llama 2-70B

Metrics

  • Exact Match (EM)
  • Accuracy
  • pass@k
  • F1

Datasets

  • MMLU
  • GSM8K
  • HumanEval
  • MBPP
  • TriviaQA
  • Natural Questions
  • WebQuestions
  • ARC-e
  • ARC-c
  • RACE
  • DROP
  • MATH
  • BBH
  • LAMBADA
  • HellaSwag
  • WinoGrande
  • AGIEval
  • C-Eval
  • MGSM
  • TyDi QA
  • TruthfulQA
  • RealToxicityPrompts

Benchmarks

  • Knowledge (TriviaQA/NQ/WebQuestions/MMLU/AGIEval/ARC)
  • Reasoning (BBH/LAMBADA/HellaSwag/WinoGrande)
  • Comprehension (RACE/DROP)
  • Math (GSM8K/MATH)
  • Coding (HumanEval/MBPP)
  • Multilingual (AGIEval-ZH/C-Eval/MGSM/TyDi QA)
  • Safety (TruthfulQA/RealToxicityPrompts)