A tough, multi-domain benchmark (math, physics, biology, chemistry, law) that reveals large LLM gaps and tests rubric-based self-evaluation

Overview

Decision SnapshotReady For Pilot

ARB is production-ready as a diagnostic benchmark. The rubric self-eval is a promising cost-saver but not yet reliable enough to replace humans for high-stakes grading.

Citations14

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY 4.0 (dataset); helper code MIT

At A Glance

Cost impact: 40%

Production readiness: 100%

Novelty: 60%

Authors

Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ARB exposes gaps in LLM symbolic and proof reasoning; companies should benchmark high-stakes systems on ARB-like items before relying on automation.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

ARB is a new benchmark of graduate- and professional-level problems across mathematics, physics, MCAT science/reading, and U.S. law. It focuses on short-answer and open-response problems that are hard to grade automatically. Evaluations show state-of-the-art LLMs (GPT-4, Claude, GPT-3.5) do well on multiple-choice law/MCAT items but perform poorly on symbolic and proof-like quantitative tasks (e.g., GPT-4: 18% math-symbolic, 28% physics-symbolic). The paper also proposes LLM-generated rubrics and self-evaluation; GPT-4 rubric scores correlate well with human graders (correlations 0.78–0.91) but still over/under-assign partial credit in many cases.

Problem Statement

Existing benchmarks are getting too easy for large models. ARB aims to test expert-level reasoning by collecting graduate and professional problems in math, physics, MCAT science/reading, and law. The dataset emphasizes short-answer and open-response formats that require true symbolic reasoning or manual grading.

Main Contribution

ARB benchmark: hundreds of graduate/professional problems across math, physics, MCAT, and U.S. law, focused on short-answer and open-response items.

Evaluation of multiple public LLMs (GPT-4, gpt-3.5-turbo, text-davinci-003, claude-v1.3-100k) on ARB with standardized prompts and parsing rules.

Key Findings

Top LLMs score very low on symbolic quantitative tasks.

NumbersGPT-4: math-symbolic 18%, physics-symbolic 28% (Table 2)

Practical UseDo not assume LLMs can solve graduate-level symbolic math/physics; use ARB to stress-test symbolic reasoning before deployment.

Evidence RefTable 2

Multiple-choice performance is high while short-answer numeric/symbolic performance is low.

NumbersGPT-3.5 failed to output parsable answers ~25% on Law; other models <5%, GPT-4 parsed >99% (Section 4)

Practical UsePrefer closed-format multiple-choice for automated scoring work; open responses need careful parsing and human review.

Evidence RefSection 4 and Figure 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 18%	—	—	ARB math symbolic	Table 2: manually parsed symbolic scores	Table 2
Accuracy	GPT-4 28%	—	—	ARB physics symbolic	Table 2: manually parsed symbolic scores	Table 2

What To Try In 7 Days

Run your key LLM on a small ARB subset (math/physics symbolic) to measure real-world symbolic weakness.

Add a rubric-based self-eval pass (GPT-4 rubric generation) to flag likely partial-credit cases for human review.

Switch high-stakes checks from free-form to structured multiple-choice or tightly parsed numeric formats where possible.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC BY 4.0 (dataset); helper code MIT

Code URLs

https://arb.duckai.org/api/lib https://app.swaggerhub.com/apis-docs/arb-dataset/arb-api/1.0.5 https://arxiv.org/abs/2307.13692v2

Data URLs

https://arb.duckai.org/api/lib https://app.swaggerhub.com/apis-docs/arb-dataset/arb-api/1.0.5

Risks & Boundaries

Limitations

Possible data contamination: some source material may be in model training data.

Automated grading is limited for many symbolic and proof-like answers.

When Not To Use

As the sole grader for high-stakes symbolic or proof-like answers.

For multimodal evaluation where image-based problems are central (not fully covered).

Failure Modes

Parsing failures: models fail to output the expected ANSWER: delimiter and get marked incorrect.

Rubric hallucination or extra credit: GPT-4 sometimes assigns points for steps not in rubric (Table 4).

Core Entities

Models

gpt-4-0314gpt-3.5-turbo-0301text-davinci-003claude-v1.3-100k

Metrics

Accuracysymbolic_equivalencerubric_scorecorrelation

Datasets

ARB

Benchmarks

GSM8KMATHMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top LLMs score very low on symbolic quantitative tasks.

Multiple-choice performance is high while short-answer numeric/symbolic performance is low.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding