LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

May 19, 20258 min

Overview

Decision SnapshotNeeds Validation

The benchmark and judge are well-documented and validated against experts, making them useful for research and controlled evaluation; they do not make LLMs production-safe for legal work without human oversight.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY 4.0

At A Glance

Cost impact: 40%

Production readiness: 35%

Novelty: 60%

Authors

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LEXAM exposes where LLMs still fail on long-form, high-stakes legal tasks and gives a validated, scalable judge to grade open answers—useful for vendor selection, risk assessments, and controlled piloting of legal AI tools.

Who Should Care

Summary TLDR

LEXAM is a multilingual benchmark built from 340 real law school exams to test long-form legal reasoning. The authors assemble thousands of questions (paper reports 4,886 total; detailed breakdown shows 2,841 open questions and a converted set of 1,660 MCQs), provide professor-written reference answers and stepwise guidance, and validate an ensemble “LLM-as-a-judge” that matches human experts. Results show top reasoning models reach ~70% on judged open answers and ~63% on MCQs but struggle on multi-step, structured legal reasoning and on perturbed MCQs with many distractors. Code and data are public.

Problem Statement

Current LLM benchmarks emphasize final-answer accuracy and STEM-style checks. Legal reasoning needs long-form, stepwise evaluation and reliable judges. The paper builds a dataset and an evaluation pipeline to test process-based legal reasoning and to validate whether LLM judges can substitute human experts.

Main Contribution

LEXAM dataset: law-school exams (340 exams) with professor solutions, multilingual (English/German) and fine-grained metadata.

Evaluation pipelines for open-ended and MCQ formats, including an expert-tuned LLM judge ensemble and human validation.

Key Findings

Top reasoning models score substantially higher than others on long-form questions.

NumbersGPT-5 judge score 70.200.41); Gemini-2.5-Pro 67.40 (Table 1)

Practical UseUse reasoning-optimized models (GPT-5, Gemini-2.5-Pro) when you need deeper legal analysis; expect non-reasoning models to lag by 10–30 points on judged open answers.

Evidence RefTable 1

MCQ accuracy is lower than judged open-answer scores and drops with more distractors.

NumbersGPT-5 MCQ acc. 62.65%1.17); Gemini drops 68.61%35.62% from 432 choices (Table 11, Table 2)

Practical UseMCQ accuracy overestimates model understanding; test models under distractor-heavy setups before trusting MCQ-based performance.

Evidence RefTable 11; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Open question judged scoreGPT-5: 70.200.41) by ensemble judgeLEXAM open test set (2,541 test + 300 dev)Table 1 reports ensemble-judged scores for open questionsTable 1
AccuracyGPT-5: 62.65%1.17)random ≈25%≈+37.6 pp vs randomLEXAM MCQs (1,660)Table 11Table 11

What To Try In 7 Days

Run your top models on a LEXAM subset to spot gaps in process reasoning (open answers).

Validate an LLM-judge ensemble on 50 expert-annotated items using the Alt-test before automating grading.

Stress-test MCQ-based workflows by adding distractor-heavy versions (8–32 choices) to detect guessing strategies.

Reproducibility

Risks & Boundaries

Limitations

Dataset is Swiss-heavy and not a broad cross-jurisdiction benchmark (authors note expansion planned).

English and German items are not parallel translations, so language and legal differences are confounded.

When Not To Use

Do not use LEXAM scores as proof of legal correctness for real cases or to replace lawyers.

Avoid deploying models scored on LEXAM directly in production without expert oversight.

Failure Modes

Hallucinated or incorrect statutory citations that look plausible.

Poor multilingual (German) performance in smaller models, causing incoherent outputs.

Core Entities

Models

GPT-5Gemini-2.5-ProGPT-4.1GPT-4oClaude-4.5-SonnetDeepSeek-R1Qwen3-32BLlama-4-Maverick

Metrics

Judge score (0–100 scaled 0.0–1.0 in prompts)AccuracyBootstrap standard errorPearson r, quadratic weighted κ, MAE for human agreementAlt-test winning rate ω and advantage prob. ρ

Datasets

LEXAM (this paper)COLIEELegalBenchLawBenchLexGLUEMMLU (legal subset)

Benchmarks

LEXAM open questions (judge score)AccuracyMCQ perturbation (4/8/16/32 choices)