LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

May 19, 20258 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus

Links

Abstract / PDF

Why It Matters For Business

LEXAM exposes where LLMs still fail on long-form, high-stakes legal tasks and gives a validated, scalable judge to grade open answers—useful for vendor selection, risk assessments, and controlled piloting of legal AI tools.

Summary TLDR

LEXAM is a multilingual benchmark built from 340 real law school exams to test long-form legal reasoning. The authors assemble thousands of questions (paper reports 4,886 total; detailed breakdown shows 2,841 open questions and a converted set of 1,660 MCQs), provide professor-written reference answers and stepwise guidance, and validate an ensemble “LLM-as-a-judge” that matches human experts. Results show top reasoning models reach ~70% on judged open answers and ~63% on MCQs but struggle on multi-step, structured legal reasoning and on perturbed MCQs with many distractors. Code and data are public.

Problem Statement

Current LLM benchmarks emphasize final-answer accuracy and STEM-style checks. Legal reasoning needs long-form, stepwise evaluation and reliable judges. The paper builds a dataset and an evaluation pipeline to test process-based legal reasoning and to validate whether LLM judges can substitute human experts.

Main Contribution

LEXAM dataset: law-school exams (340 exams) with professor solutions, multilingual (English/German) and fine-grained metadata.

Evaluation pipelines for open-ended and MCQ formats, including an expert-tuned LLM judge ensemble and human validation.

Baselines across 36 LLMs showing SOTA models still struggle on multi-step legal reasoning and perturbed MCQs.

Open release of code and data (project page, GitHub, Hugging Face) for reproducible evaluation.

Key Findings

Top reasoning models score substantially higher than others on long-form questions.

NumbersGPT-5 judge score 70.20 (±0.41); Gemini-2.5-Pro 67.40 (Table 1)

MCQ accuracy is lower than judged open-answer scores and drops with more distractors.

NumbersGPT-5 MCQ acc. 62.65% (±1.17); Gemini drops 68.61%→35.62% from 4→32 choices (Table 11, Table 2)

An ensemble LLM judge matches or exceeds human experts in consistency tests.

NumbersAlt-test winning rate ω = 1.00; expert Pearson r = 0.70, κ = 0.49, MAE = 1.95 (Section 5, Table 3)

Models perform worse on German and Switzerland-specific questions.

NumbersPerformance gap: English > German across model groups (Figure 4, Section 4.1)

Results

Open question judged score

ValueGPT-5: 70.20 (±0.41) by ensemble judge

Accuracy

ValueGPT-5: 62.65% (±1.17)

Baselinerandom ≈25%

Judge validation (Alt-test)

ValueEnsemble winning rate ω = 1.00; advantage prob. ρ up to 0.76

Baselinehuman experts (3 annotators)

Robustness to distractors

ValueGemini-2.5-Pro: 68.61% → 35.62% accuracy (4→32 choices)

Who Should Care

What To Try In 7 Days

Run your top models on a LEXAM subset to spot gaps in process reasoning (open answers).

Validate an LLM-judge ensemble on 50 expert-annotated items using the Alt-test before automating grading.

Stress-test MCQ-based workflows by adding distractor-heavy versions (8–32 choices) to detect guessing strategies.

Reproducibility

License

  • CC BY 4.0

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset is Swiss-heavy and not a broad cross-jurisdiction benchmark (authors note expansion planned).
  • English and German items are not parallel translations, so language and legal differences are confounded.
  • No large-scale human performance baseline due to institutional limits; human data limited to small expert sample.
  • Inconsistencies in reported MCQ counts across sections (paper reports 2,045 MCQs in places and 1,660 elsewhere).

When Not To Use

  • Do not use LEXAM scores as proof of legal correctness for real cases or to replace lawyers.
  • Avoid deploying models scored on LEXAM directly in production without expert oversight.
  • Don't use MCQ-only evaluations as sole evidence of legal competence.

Failure Modes

  • Hallucinated or incorrect statutory citations that look plausible.
  • Poor multilingual (German) performance in smaller models, causing incoherent outputs.
  • High sensitivity to MCQ formatting and the number/order of distractors, enabling guessing.
  • Overconfident but shallow doctrinal reasoning that omits intermediate legal steps.

Core Entities

Models

  • GPT-5
  • Gemini-2.5-Pro
  • GPT-4.1
  • GPT-4o
  • Claude-4.5-Sonnet
  • DeepSeek-R1
  • Qwen3-32B
  • Llama-4-Maverick

Metrics

  • Judge score (0–100 scaled 0.0–1.0 in prompts)
  • Accuracy
  • Bootstrap standard error
  • Pearson r, quadratic weighted κ, MAE for human agreement
  • Alt-test winning rate ω and advantage prob. ρ

Datasets

  • LEXAM (this paper)
  • COLIEE
  • LegalBench
  • LawBench
  • LexGLUE
  • MMLU (legal subset)

Benchmarks

  • LEXAM open questions (judge score)
  • Accuracy
  • MCQ perturbation (4/8/16/32 choices)