LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Overview

Decision SnapshotNeeds Validation

The benchmark and judge are well-documented and validated against experts, making them useful for research and controlled evaluation; they do not make LLMs production-safe for legal work without human oversight.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY 4.0

At A Glance

Cost impact: 40%

Production readiness: 35%

Novelty: 60%

Authors

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LEXAM exposes where LLMs still fail on long-form, high-stakes legal tasks and gives a validated, scalable judge to grade open answers—useful for vendor selection, risk assessments, and controlled piloting of legal AI tools.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

LEXAM is a multilingual benchmark built from 340 real law school exams to test long-form legal reasoning. The authors assemble thousands of questions (paper reports 4,886 total; detailed breakdown shows 2,841 open questions and a converted set of 1,660 MCQs), provide professor-written reference answers and stepwise guidance, and validate an ensemble “LLM-as-a-judge” that matches human experts. Results show top reasoning models reach ~70% on judged open answers and ~63% on MCQs but struggle on multi-step, structured legal reasoning and on perturbed MCQs with many distractors. Code and data are public.

Problem Statement

Current LLM benchmarks emphasize final-answer accuracy and STEM-style checks. Legal reasoning needs long-form, stepwise evaluation and reliable judges. The paper builds a dataset and an evaluation pipeline to test process-based legal reasoning and to validate whether LLM judges can substitute human experts.

Main Contribution

LEXAM dataset: law-school exams (340 exams) with professor solutions, multilingual (English/German) and fine-grained metadata.

Evaluation pipelines for open-ended and MCQ formats, including an expert-tuned LLM judge ensemble and human validation.

Key Findings

Top reasoning models score substantially higher than others on long-form questions.

NumbersGPT-5 judge score 70.20 (±0.41); Gemini-2.5-Pro 67.40 (Table 1)

Practical UseUse reasoning-optimized models (GPT-5, Gemini-2.5-Pro) when you need deeper legal analysis; expect non-reasoning models to lag by 10–30 points on judged open answers.

Evidence RefTable 1

MCQ accuracy is lower than judged open-answer scores and drops with more distractors.

NumbersGPT-5 MCQ acc. 62.65% (±1.17); Gemini drops 68.61%→35.62% from 4→32 choices (Table 11, Table 2)

Practical UseMCQ accuracy overestimates model understanding; test models under distractor-heavy setups before trusting MCQ-based performance.

Evidence RefTable 11; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Open question judged score	GPT-5: 70.20 (±0.41) by ensemble judge	—	—	LEXAM open test set (2,541 test + 300 dev)	Table 1 reports ensemble-judged scores for open questions	Table 1
Accuracy	GPT-5: 62.65% (±1.17)	random ≈25%	≈+37.6 pp vs random	LEXAM MCQs (1,660)	Table 11	Table 11

What To Try In 7 Days

Run your top models on a LEXAM subset to spot gaps in process reasoning (open answers).

Validate an LLM-judge ensemble on 50 expert-annotated items using the Alt-test before automating grading.

Stress-test MCQ-based workflows by adding distractor-heavy versions (8–32 choices) to detect guessing strategies.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCC BY 4.0

Code URLs

https://lexam-benchmark.github.io/https://huggingface.co/ (dataset release referenced in paper)

Data URLs

https://lexam-benchmark.github.io/https://huggingface.co/ (dataset release referenced in paper)

Risks & Boundaries

Limitations

Dataset is Swiss-heavy and not a broad cross-jurisdiction benchmark (authors note expansion planned).

English and German items are not parallel translations, so language and legal differences are confounded.

When Not To Use

Do not use LEXAM scores as proof of legal correctness for real cases or to replace lawyers.

Avoid deploying models scored on LEXAM directly in production without expert oversight.

Failure Modes

Hallucinated or incorrect statutory citations that look plausible.

Poor multilingual (German) performance in smaller models, causing incoherent outputs.

Core Entities

Models

GPT-5Gemini-2.5-ProGPT-4.1GPT-4oClaude-4.5-SonnetDeepSeek-R1Qwen3-32BLlama-4-Maverick

Metrics

Judge score (0–100 scaled 0.0–1.0 in prompts)AccuracyBootstrap standard errorPearson r, quadratic weighted κ, MAE for human agreementAlt-test winning rate ω and advantage prob. ρ

Datasets

LEXAM (this paper)COLIEELegalBenchLawBenchLexGLUEMMLU (legal subset)

Benchmarks

LEXAM open questions (judge score)AccuracyMCQ perturbation (4/8/16/32 choices)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top reasoning models score substantially higher than others on long-form questions.

MCQ accuracy is lower than judged open-answer scores and drops with more distractors.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding