Overview
Production Readiness
1
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
14
Why It Matters For Business
ARB exposes gaps in LLM symbolic and proof reasoning; companies should benchmark high-stakes systems on ARB-like items before relying on automation.
Summary TLDR
ARB is a new benchmark of graduate- and professional-level problems across mathematics, physics, MCAT science/reading, and U.S. law. It focuses on short-answer and open-response problems that are hard to grade automatically. Evaluations show state-of-the-art LLMs (GPT-4, Claude, GPT-3.5) do well on multiple-choice law/MCAT items but perform poorly on symbolic and proof-like quantitative tasks (e.g., GPT-4: 18% math-symbolic, 28% physics-symbolic). The paper also proposes LLM-generated rubrics and self-evaluation; GPT-4 rubric scores correlate well with human graders (correlations 0.78–0.91) but still over/under-assign partial credit in many cases.
Problem Statement
Existing benchmarks are getting too easy for large models. ARB aims to test expert-level reasoning by collecting graduate and professional problems in math, physics, MCAT science/reading, and law. The dataset emphasizes short-answer and open-response formats that require true symbolic reasoning or manual grading.
Main Contribution
ARB benchmark: hundreds of graduate/professional problems across math, physics, MCAT, and U.S. law, focused on short-answer and open-response items.
Evaluation of multiple public LLMs (GPT-4, gpt-3.5-turbo, text-davinci-003, claude-v1.3-100k) on ARB with standardized prompts and parsing rules.
A rubric-based, LLM-generated self-evaluation pipeline where GPT-4 generates rubrics from reference solutions and grades model reasoning; human comparison shows moderate-to-high agreement.
Key Findings
Top LLMs score very low on symbolic quantitative tasks.
Multiple-choice performance is high while short-answer numeric/symbolic performance is low.
Rubric-based self-evaluation correlates strongly with human grading.
Results
Accuracy
Accuracy
Rubric vs human correlation
Accuracy
Who Should Care
What To Try In 7 Days
Run your key LLM on a small ARB subset (math/physics symbolic) to measure real-world symbolic weakness.
Add a rubric-based self-eval pass (GPT-4 rubric generation) to flag likely partial-credit cases for human review.
Switch high-stakes checks from free-form to structured multiple-choice or tightly parsed numeric formats where possible.
Reproducibility
License
- CC BY 4.0 (dataset); helper code MIT
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible data contamination: some source material may be in model training data.
- Automated grading is limited for many symbolic and proof-like answers.
- Rubric-based self-evaluation can assign extra or reduced credit; needs human audits.
- Benchmark omits multimodal questions for most evaluations and keeps some data behind an API.
When Not To Use
- As the sole grader for high-stakes symbolic or proof-like answers.
- For multimodal evaluation where image-based problems are central (not fully covered).
- To claim comprehensive human-like competence — ARB is narrow and expert-focused.
Failure Modes
- Parsing failures: models fail to output the expected ANSWER: delimiter and get marked incorrect.
- Rubric hallucination or extra credit: GPT-4 sometimes assigns points for steps not in rubric (Table 4).
- False negatives in symbolic equivalence checks by cheaper models (GPT-3.5 misses equivalent expressions).
Core Entities
Models
- gpt-4-0314
- gpt-3.5-turbo-0301
- text-davinci-003
- claude-v1.3-100k
Metrics
- Accuracy
- symbolic_equivalence
- rubric_score
- correlation
Datasets
- ARB
Benchmarks
- GSM8K
- MATH
- MMLU

