Overview
ARB is production-ready as a diagnostic benchmark. The rubric self-eval is a promising cost-saver but not yet reliable enough to replace humans for high-stakes grading.
Citations14
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Partial
License: CC BY 4.0 (dataset); helper code MIT
At A Glance
Cost impact: 40%
Production readiness: 100%
Novelty: 60%
Why It Matters For Business
ARB exposes gaps in LLM symbolic and proof reasoning; companies should benchmark high-stakes systems on ARB-like items before relying on automation.
Who Should Care
Summary TLDR
ARB is a new benchmark of graduate- and professional-level problems across mathematics, physics, MCAT science/reading, and U.S. law. It focuses on short-answer and open-response problems that are hard to grade automatically. Evaluations show state-of-the-art LLMs (GPT-4, Claude, GPT-3.5) do well on multiple-choice law/MCAT items but perform poorly on symbolic and proof-like quantitative tasks (e.g., GPT-4: 18% math-symbolic, 28% physics-symbolic). The paper also proposes LLM-generated rubrics and self-evaluation; GPT-4 rubric scores correlate well with human graders (correlations 0.78–0.91) but still over/under-assign partial credit in many cases.
Problem Statement
Existing benchmarks are getting too easy for large models. ARB aims to test expert-level reasoning by collecting graduate and professional problems in math, physics, MCAT science/reading, and law. The dataset emphasizes short-answer and open-response formats that require true symbolic reasoning or manual grading.
Main Contribution
ARB benchmark: hundreds of graduate/professional problems across math, physics, MCAT, and U.S. law, focused on short-answer and open-response items.
Evaluation of multiple public LLMs (GPT-4, gpt-3.5-turbo, text-davinci-003, claude-v1.3-100k) on ARB with standardized prompts and parsing rules.
Key Findings
Top LLMs score very low on symbolic quantitative tasks.
Multiple-choice performance is high while short-answer numeric/symbolic performance is low.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 18% | — | — | ARB math symbolic | Table 2: manually parsed symbolic scores | Table 2 |
| Accuracy | GPT-4 28% | — | — | ARB physics symbolic | Table 2: manually parsed symbolic scores | Table 2 |
What To Try In 7 Days
Run your key LLM on a small ARB subset (math/physics symbolic) to measure real-world symbolic weakness.
Add a rubric-based self-eval pass (GPT-4 rubric generation) to flag likely partial-credit cases for human review.
Switch high-stakes checks from free-form to structured multiple-choice or tightly parsed numeric formats where possible.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Possible data contamination: some source material may be in model training data.
Automated grading is limited for many symbolic and proof-like answers.
When Not To Use
As the sole grader for high-stakes symbolic or proof-like answers.
For multimodal evaluation where image-based problems are central (not fully covered).
Failure Modes
Parsing failures: models fail to output the expected ANSWER: delimiter and get marked incorrect.
Rubric hallucination or extra credit: GPT-4 sometimes assigns points for steps not in rubric (Table 4).

