Benchmark Leakage Papers — Parsed & Scored for Practitioners

GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

0.30

0.60

0.70

497

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Key finding

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

Numbers: USMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

0.40

0.60

42

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Key finding

Response variability breaks standardized tests

Numbers: 22/23 benchmarks showed sensitivity

ChatGPT/GPT‑4 can directly rank search passages with simple prompts; distilled small models inherit that power.

0.65

0.55

0.60

23

LLMs can directly re-rank search results zero‑shot and produce supervisory labels to train small, cheaper re‑rankers; this can cut inference cost and maintenance versus training large supervised re‑rankers on noisy labels.

Key finding

GPT‑4 outperforms strong supervised re‑rankers on standard benchmarks when using permutation prompts.

Numbers: nDCG@10: GPT‑4 53.68 vs monoT5 (3B) 51.36 on BEIR (avg), delta +2.32

Live, contamination-aware benchmark for code LLMs that tests generation, repair, execution, and test-output prediction

0.70

0.65

0.55

22

LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.

Key finding

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

Numbers: DS-Base-33B: Pass@1 ~60 (May) → ~0 (Sep) on LeetCode

A low-cost, practical method that finds whether LLMs memorized evaluation datasets

0.80

0.60

0.70

22

If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.

Key finding

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

Numbers: GPT-4 14/14 (100%); GPT-3.5 13/14 (92.86%) on 14 partitions

LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

0.30

0.35

0.40

19

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Key finding

GPT-4 is the best model on LawBench but far from perfect

Numbers: GPT-4 average zero-shot 52.35 (Table 26)

LiveBench: a live, ground-truth-scored benchmark that resists test-set contamination

0.70

0.60

0.70

18

A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.

Key finding

Top models perform below human-like saturation on LiveBench.

Numbers: Top LiveBench score 64.7% (o1-preview-2024-09-12).

GAOKAO-Bench: using China’s college exam (2010–2022) to test LLMs on real exam questions

0.60

0.50

0.60

18

GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.

Key finding

GPT-4 attains strong exam performance but below full marks.

Numbers: Converted totals: sciences 434, humanities 480 (GPT-4-0613).

Benchmark leakage can make small LLMs look much stronger — avoid training on test or prompt data

0.50

0.40

0.60

16

Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.

Key finding

Leaking training & test data greatly inflates benchmark scores.

Numbers: phi-1.5 MMLU: 42.87 -> 75.05 after full leak (Table 1)

Systematic review shows GPT-3.5/GPT-4 were exposed to ~4.7M benchmark examples and many evaluations are unfair or unreproducible

0.40

0.30

0.60

16

Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.

Key finding

Many published evaluations leaked data to OpenAI via the web interface.

Numbers: 90 papers (≈42% of relevant papers) used browser access that could be used to improve models

An open math-specialized LLM (7B & 34B) that improves math problem solving and formal proving

0.60

0.70

0.60

13

LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.

Key finding

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

Numbers: GSM8k few-shot: LLEMMA-34B 51.5% vs Code Llama-34B 29.6% (+21.9 pp)

Measure LLM behavior without labels by testing how outputs change under simple text edits

0.60

0.50

12

You can monitor key model behaviors on your own live or private data without building labeled test sets, enabling faster, cheaper, and continuously updated audits of knowledge, toxicity, and robustness.

Key finding

A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.

Numbers: 1000-example sensitivity, std error < 0.002; plotted sqrt-like fit vs TriviaQA

A fast finetuning recipe that makes a large LLM 'forget' Harry Potter while keeping general skills

0.40

0.70

12

You can remove copyrighted or sensitive text from a large LLM with a short, targeted finetune instead of full retraining, cutting compute from hundreds of thousands of GPU-hours to minutes–hours for the targeted edit.

Key finding

The method dramatically reduces model 'familiarity' with Harry Potter as measured by completion-based tests.

Numbers: Familiarity (completion): 0.29 → 0.007 after ~120 finetuning steps

Membership inference mostly fails on pretrained LLMs; apparent successes often come from dataset shifts

0.40

0.60

0.40

10

Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.

Key finding

Existing MIAs mostly fail against pre-trained LLMs.

Numbers: Most AUC ROC < 0.6 across domains (Table 1).

LLMs are powerful text engines but lack the grounded action and world models needed for true AGI

0.40

0.45

0.40

10

LLMs are strong language tools but not reliable autonomous reasoners; businesses should treat them as assistants, validate critical outputs, and invest in grounded data and robust evaluation before automating decisions.

Key finding

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

Numbers: GPT-4: SAT Verbal ~169/170 (~99th), SAT Math ~700/800 (~89th); poor on Gaokao/JEE (see AGIEval/JEEBench)

GPT-4 is promising for dementia screening but does not yet beat the best traditional models

0.30

0.50

0.40

9

LLMs like GPT-4 can speed prototype building (no training set needed) and give readable explanations, but they don't yet match tuned clinical models; deploy cautiously and use hybrids.

Key finding

GPT-4 does not beat the best supervised model (RRL) on evaluated datasets.

Numbers: ADNI: GPT-4 0.820 vs RRL 0.852; PUMCH-T few-shot: GPT-4 0.632 vs RRL 0.763

Xiezhi: 249k-question, auto-updating benchmark across 516 disciplines with a 50-option evaluation protocol

0.70

0.60

0.30

9

Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.

Key finding

Xiezhi is very large and multi-disciplinary.

Numbers: 249,587 questions; 516 disciplines; 13 categories

Mask-and-retrieve tests show many benchmarks can leak into LLM training

0.50

0.60

8

If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.

Key finding

Closed-source LLMs often reproduce masked wrong options in MMLU.

Numbers: ChatGPT EM 52%, GPT-4 EM 57% on MMLU (Table 3)

Measure and report when LLMs have seen benchmark data to avoid invalid NLP claims

0.50

0.60

0.70

6

If model evaluation is contaminated, product decisions and vendor comparisons can be wrong; verify exposure to benchmarks before basing choices on published scores.

Key finding

Contamination inflates evaluated model performance and can lead to wrong scientific conclusions.

SecQA: a compact multiple-choice benchmark to test LLM knowledge of computer security

0.40

0.20

6

SecQA gives a quick, domain-specific check of LLM security knowledge. Use it to benchmark models before deploying them on security tasks and to spot when open models need domain tuning or retrieval augmentation.

Key finding

GPT-3.5-Turbo and GPT-4 achieve near-perfect accuracy on SecQA v1 and very high on v2.

Numbers: SecQAv1: GPT-3.5 99.1% 0/5-shot; GPT-4 99.1%/100% 0/5-shot. SecQAv2: GPT-3.5 98.0% / GPT-4 98.0%

Survey of how benchmark leaks (data contamination) distort LLM evaluations and practical fixes

0.60

0.50

0.60

6

Contaminated benchmarks can make models look better than they are, misleading product decisions and inflating R&D ROI claims.

Key finding

BDC has four severity levels: semantic, information, data, and label exposure.

Practical survey: why training/test overlap (data contamination) breaks LLM evaluations

0.70

0.50

0.60

5

Contaminated evaluations can create false confidence about model quality and lead to bad product choices; verifying contamination protects model selection and user trust.

Key finding

Data contamination is common and, at scale, effectively inevitable.

Biomedical LLMs often underperform general models on unseen clinical data

0.40

0.35

0.30

5

Fine-tuning on public biomedical text does not reliably boost performance on new clinical tasks and can reduce reliability; use large general models or retrieval systems for production clinical features.

Key finding

Generalist models often outperform biomedical fine-tuned models on unseen clinical case vignettes.

Numbers: JAMA: OpenBioLLM-70B 66.4% vs Llama-3-70B-Instruct 65%

A realistic, evolving benchmark for repository-level code generation drawn from recent GitHub projects

0.30

0.60

0.50

5

EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.

Key finding

EvoCodeBench-2403 size and distribution match recent repositories.

Numbers: 275 samples, 25 repos; standalone 27% / non-standalone 73%; avg dependencies 3.46