A tough, multi-domain benchmark (math, physics, biology, chemistry, law) that reveals large LLM gaps and tests rubric-based self-evaluation

July 25, 20236 min

Overview

Production Readiness

1

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

14

Authors

Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki

Links

Abstract / PDF

Why It Matters For Business

ARB exposes gaps in LLM symbolic and proof reasoning; companies should benchmark high-stakes systems on ARB-like items before relying on automation.

Summary TLDR

ARB is a new benchmark of graduate- and professional-level problems across mathematics, physics, MCAT science/reading, and U.S. law. It focuses on short-answer and open-response problems that are hard to grade automatically. Evaluations show state-of-the-art LLMs (GPT-4, Claude, GPT-3.5) do well on multiple-choice law/MCAT items but perform poorly on symbolic and proof-like quantitative tasks (e.g., GPT-4: 18% math-symbolic, 28% physics-symbolic). The paper also proposes LLM-generated rubrics and self-evaluation; GPT-4 rubric scores correlate well with human graders (correlations 0.78–0.91) but still over/under-assign partial credit in many cases.

Problem Statement

Existing benchmarks are getting too easy for large models. ARB aims to test expert-level reasoning by collecting graduate and professional problems in math, physics, MCAT science/reading, and law. The dataset emphasizes short-answer and open-response formats that require true symbolic reasoning or manual grading.

Main Contribution

ARB benchmark: hundreds of graduate/professional problems across math, physics, MCAT, and U.S. law, focused on short-answer and open-response items.

Evaluation of multiple public LLMs (GPT-4, gpt-3.5-turbo, text-davinci-003, claude-v1.3-100k) on ARB with standardized prompts and parsing rules.

A rubric-based, LLM-generated self-evaluation pipeline where GPT-4 generates rubrics from reference solutions and grades model reasoning; human comparison shows moderate-to-high agreement.

Key Findings

Top LLMs score very low on symbolic quantitative tasks.

NumbersGPT-4: math-symbolic 18%, physics-symbolic 28% (Table 2)

Multiple-choice performance is high while short-answer numeric/symbolic performance is low.

NumbersGPT-3.5 failed to output parsable answers ~25% on Law; other models <5%, GPT-4 parsed >99% (Section 4)

Rubric-based self-evaluation correlates strongly with human grading.

NumbersCorrelation: physics 0.91, math 0.78, proof-like 0.82 (Table 5)

Results

Accuracy

ValueGPT-4 18%

Accuracy

ValueGPT-4 28%

Rubric vs human correlation

Valuephysics 0.91, math 0.78, proof-like 0.82

Accuracy

Valueaccuracy 0.67 physics, 0.76 math

Who Should Care

What To Try In 7 Days

Run your key LLM on a small ARB subset (math/physics symbolic) to measure real-world symbolic weakness.

Add a rubric-based self-eval pass (GPT-4 rubric generation) to flag likely partial-credit cases for human review.

Switch high-stakes checks from free-form to structured multiple-choice or tightly parsed numeric formats where possible.

Reproducibility

License

  • CC BY 4.0 (dataset); helper code MIT

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible data contamination: some source material may be in model training data.
  • Automated grading is limited for many symbolic and proof-like answers.
  • Rubric-based self-evaluation can assign extra or reduced credit; needs human audits.
  • Benchmark omits multimodal questions for most evaluations and keeps some data behind an API.

When Not To Use

  • As the sole grader for high-stakes symbolic or proof-like answers.
  • For multimodal evaluation where image-based problems are central (not fully covered).
  • To claim comprehensive human-like competence — ARB is narrow and expert-focused.

Failure Modes

  • Parsing failures: models fail to output the expected ANSWER: delimiter and get marked incorrect.
  • Rubric hallucination or extra credit: GPT-4 sometimes assigns points for steps not in rubric (Table 4).
  • False negatives in symbolic equivalence checks by cheaper models (GPT-3.5 misses equivalent expressions).

Core Entities

Models

  • gpt-4-0314
  • gpt-3.5-turbo-0301
  • text-davinci-003
  • claude-v1.3-100k

Metrics

  • Accuracy
  • symbolic_equivalence
  • rubric_score
  • correlation

Datasets

  • ARB

Benchmarks

  • GSM8K
  • MATH
  • MMLU