Benchmark Robustness Papers — Parsed & Scored for Practitioners

GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

0.30

0.60

0.70

497

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Key finding

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

Numbers: USMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

0.60

0.20

0.60

313

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Key finding

Prompt wording matters but has only modest effect.

Numbers: Best prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

0.70

0.60

171

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Key finding

Automated augmentation increases tests per task from single-digit to hundreds.

Numbers: HumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

0.40

0.35

0.50

91

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Key finding

GPT-4 ranks best across the eight chemistry tasks.

Numbers: Average rank: GPT-4 = 1.25 (Table 2).

ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

0.60

0.30

0.70

89

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Key finding

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

Numbers: NEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

0.70

0.60

0.70

82

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Key finding

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbers: near‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

1.00

79

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Key finding

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

Numbers: Variation 1A: P(chocolate)=95% vs P(popcorn)=0%

ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

0.50

0.40

59

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Key finding

ChatGPT underperforms supervised baselines on Standard-IE tasks.

Numbers: Standard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

Appending short emotional phrases to prompts measurably improves LLM outputs

0.60

0.50

0.40

57

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Key finding

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers: 8.00% relative improvement on Instruction Induction (Table 1)

LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

0.60

0.40

0.60

55

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Key finding

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

Numbers: ChatGPT ≈97% of T5 performance on SC tasks (paper text).

Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

0.30

0.40

0.60

51

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Key finding

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

Numbers: XNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

Imitating ChatGPT copies style, not capabilities

0.40

0.50

0.60

50

Imitation can cheaply copy a proprietary model's tone and safety but does not replicate its core reasoning or factual knowledge, so relying on imitation to match competitors is risky.

Key finding

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Numbers: ≈70% of imitation outputs rated equal/better vs ChatGPT

Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

0.40

0.60

42

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Key finding

Response variability breaks standardized tests

Numbers: 22/23 benchmarks showed sensitivity

Small prompt formatting changes can swing LLM accuracy by tens of points

0.60

40

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Key finding

Formatting can change accuracy by very large amounts.

Numbers: Max spread 76 accuracy points (LLaMA-2-13B)

LLMs show some social reasoning but fail adversarial and robust tests

0.25

0.45

0.20

36

Don't assume LLMs understand people just because they give human-like answers; test models with adversarial and diverse benchmarks before using them for social judgments.

Key finding

Some models excel on narrow ToM-style tasks but not across the board

Numbers: TriangleCOPA: flan-t5-xxl 96% vs MFC 52%

MGTBench: a modular benchmark that measures how well detectors spot and attribute text from modern LLMs and how brittle they are to attacks

0.60

0.40

0.50

31

Automated detection helps flag AI-written content that affects trust, compliance, or fraud; MGTBench identifies which detectors work, how much labelled data they need, and where they fail under attacks.

Key finding

Fine-tuned LM Detector gives the highest detection accuracy across datasets

Numbers: F1=0.993 (Essay, human vs ChatGPT-turbo)

How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

0.40

0.30

0.70

28

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Key finding

Top LLMs approach human performance on exam-style medical questions.

Numbers: USMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

DAIL-SQL: prompt+example selection that sets a new Spider Text-to-SQL high (86.6% EX)

0.70

0.60

23

DAIL-SQL gives a practical recipe to improve Text-to-SQL accuracy while cutting token cost; that reduces API spend and speeds up production query interfaces.

Key finding

DAIL-SQL sets a new Spider top with GPT-4 and self-consistency.

Numbers: 86.6% execution accuracy (leaderboard, with self-consistency)

A low-cost, practical method that finds whether LLMs memorized evaluation datasets

0.80

0.60

0.70

22

If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.

Key finding

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

Numbers: GPT-4 14/14 (100%); GPT-3.5 13/14 (92.86%) on 14 partitions

LLMs favor certain option IDs, making multiple-choice evaluation brittle

0.60

0.50

0.70

22

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Key finding

Simple answer-moving changes cause large accuracy swings.

Numbers: gpt-3.5-turbo MMLU: 67.2 → 60.9 (−6.3) when golden moved to D; llama-30B: 53.1 → 68.2 (+15.2) when moved to A

A practical survey of using LLMs as automated evaluators, covering methods, apps, benchmarks, and risks

0.70

0.40

0.80

21

LLM judges let teams scale evaluation and feedback in minutes, reduce human labeling cost, and produce human-readable explanations that speed iteration.

Key finding

LLMs can match or exceed crowd annotators on some annotation tasks.

Numbers: GPT-4 83.6% vs MTurk 81.5% (annotation accuracy)

ChatGPT can track multi-turn dialogue states zero-shot, but struggles with slot-filling and long conversations

0.40

0.35

0.30

21

ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.

Key finding

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

Numbers: MultiWOZ2.1 JGA 60.28% vs fine-tuned SOTA 61.02% (Table 3)

Turn decoder-only LLMs into strong text encoders with three cheap steps

0.70

0.60

0.70

20

You can convert existing decoder-only LLMs into high-quality embedder models cheaply and fast (hours on one GPU) without labeled data, unlocking better retrieval and tagging with fewer resources than full retraining.

Key finding

LLM2Vec applied to Mistral-7B yields the top unsupervised MTEB score reported in the paper.

Numbers: 56.80 (MTEB avg-56, unsupervised, Mistral-7B)

A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

0.20

0.40

0.20

20

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Key finding

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

Numbers: Majority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0