Reasoning Benchmarks Papers — Parsed & Scored for Practitioners

GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

0.30

0.60

0.70

497

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Key finding

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

Numbers: USMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

0.70

0.25

0.75

352

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Key finding

ChatGPT often outperforms prior zero-shot LLMs.

Numbers: 9/13 evaluated datasets (zero-shot comparisons)

Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

0.60

0.45

96

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Key finding

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

0.70

0.50

0.60

90

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Key finding

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

Numbers: GPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

Let LLMs translate problems and a classical planner find correct, often optimal, plans

0.70

0.60

0.70

84

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Key finding

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Numbers: BLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

GPT-4 can pass Japan's medical licensing exam but shows costly localization and safety gaps

0.50

0.40

0.70

50

LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.

Key finding

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

Numbers: 2018: required 161, general 221 (passing 160/208); Table 1

RoG: Ground LLM plans on knowledge‑graph relation paths for faithful, interpretable KGQA

0.60

0.50

38

RoG reduces hallucinations by grounding LLM reasoning in KG facts and provides traceable, human-readable paths—this improves accuracy and trust on KG-backed QA without retraining every LLM.

Key finding

RoG sets new best scores on standard KGQA benchmarks.

Numbers: WebQSP Hits@1 85.7; F1 70.8. CWQ Hits@1 62.6; F1 56.2.

CMExam: 60K+ Chinese medical multiple-choice questions with explanations and fine-grained annotations

0.50

0.45

0.40

32

CMExam gives a reliable, large-scale way to measure clinical QA performance for Chinese medical LLMs so teams can identify domain gaps and cost-effectively fine-tune small models.

Key finding

GPT-4 is the top zero-shot answer predictor on CMExam.

Numbers: 61.6% accuracy (GPT-4) vs 71.6% (human)

LLMs fail at autonomous planning (~3% success) but their plans can be repaired and slightly help humans

1.00

0.60

0.40

31

If you plan to use LLMs for automated action sequencing or workflows, don't run them unsupervised — they rarely produce correct plans; use them as idea generators and pair with a certified planner or human review.

Key finding

LLMs rarely produce correct executable plans when used alone.

Numbers: GPT-3: 6/600 (1%); Instruct-GPT3: 41/600 (6.8%); BLOOM: 4/250 (1.6%); paper cites ≈3% average

M3Exam: 12k official exam questions in 9 languages (23% with images) to stress-test LLMs' multilingual and multimodal skills

0.70

0.60

0.40

31

M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.

Key finding

M3Exam totals 12,317 multiple-choice questions across 9 languages.

Numbers: 12,317 total questions; 9 languages

LegalBench: 162 lawyer-crafted tasks to test LLM legal reasoning

0.40

0.60

0.50

28

LEGALBENCH gives legal teams and ML practitioners a practical suite to test LLMs on many lawyer-defined tasks before deployment, exposing brittle cases, prompt sensitivity, and task-by-task risk.

Key finding

GPT-4 is the strongest model across most legal reasoning categories in this evaluation.

Numbers: Issue 82.9, Rule-recall 59.2, Conclusion 89.9, Interpretation 75.2, Rhetorical 79.4 (balanced-accuracy, Table 2)

CogEval: systematic tests show LLMs fail at cognitive maps and multi‑step planning

0.30

0.60

0.20

22

Do not assume LLMs can plan multi‑step tasks from text alone; failures scale with graph complexity and can cause incorrect or looping actions in planning applications.

Key finding

LLM, graph, domain, and condition strongly predict performance.

Numbers: LLM χ2=2357.87; graph χ2=3431.53; condition χ2=2080.04; domain χ2=458.74 (all p<.001)

A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

0.20

0.40

0.20

20

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Key finding

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

Numbers: Majority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0

Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

0.40

0.30

0.40

19

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Key finding

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

Numbers: InfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

ControlBench: evaluate GPT-4, Claude 3 Opus, Gemini on 147 undergraduate control problems

0.40

0.60

0.50

19

Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.

Key finding

Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.

Numbers: ACC 58.5% (86/147), ACC-s 68.7% (101/147)

Long-context LLMs fail to learn reliably from very long in‑context demonstrations

0.40

0.50

0.40

17

If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.

Key finding

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

Numbers: Discovery: most models 0%; Gemini 14%; BERT fine-tuned 87%

CMMLU — a 11.5k-question Chinese multitask benchmark exposing limits of current LLMs

0.80

0.60

0.50

16

CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.

Key finding

Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).

Numbers: GPT4 70.95% (5-shot); ChatGPT 55.51%; many models 30–62%

An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

0.60

0.40

0.60

15

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Key finding

Reasoning performance scales with model size.

Numbers: GSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

Comprehensive eval finds Gemini close to GPT‑3.5 on language commonsense, behind GPT‑4 and GPT‑4V on multimodal tasks

0.65

0.25

0.40

14

Gemini Pro is close to GPT‑3.5 for language commonsense but behind GPT‑4; pick models based on accuracy needs and multimodal complexity.

Key finding

Gemini Pro's language-only accuracy is similar to GPT‑3.5 Turbo.

Numbers: Avg acc Gemini Pro 79.2% vs GPT‑3.5 78.2% on 11 language datasets

A practical benchmark showing prompt design and graph encoding matter — LLMs can help on graph tasks but still trail graph models.

0.30

0.45

0.35

14

Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.

Key finding

Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.

Numbers: zero-shot 9.23 → zero-shot+graph 56.38

A tough, multi-domain benchmark (math, physics, biology, chemistry, law) that reveals large LLM gaps and tests rubric-based self-evaluation

1.00

0.60

0.40

14

ARB exposes gaps in LLM symbolic and proof reasoning; companies should benchmark high-stakes systems on ARB-like items before relying on automation.

Key finding

Top LLMs score very low on symbolic quantitative tasks.

Numbers: GPT-4: math-symbolic 18%, physics-symbolic 28% (Table 2)

An open math-specialized LLM (7B & 34B) that improves math problem solving and formal proving

0.60

0.70

0.60

13

LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.

Key finding

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

Numbers: GSM8k few-shot: LLEMMA-34B 51.5% vs Code Llama-34B 29.6% (+21.9 pp)

Use simple logic checks to make zero‑shot chain-of-thought answers more reliable

0.60

0.40

0.30

11

LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.

Key finding

Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models

Numbers: GSM8K: 78.75 → 80.15 (+1.40% abs); AQuA: 57.09 → 60.63 (+3.54% abs)

MATHVISTA: a 6k multimodal benchmark showing GPT-4V is strongest but still ~10% behind humans

0.60

11

MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.

Key finding

GPT-4V is the best model but still below humans.

Numbers: GPT-4V 49.9% vs human 60.3% (gap 10.4%)