Overview
Production Readiness
0.35
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
LEXAM exposes where LLMs still fail on long-form, high-stakes legal tasks and gives a validated, scalable judge to grade open answers—useful for vendor selection, risk assessments, and controlled piloting of legal AI tools.
Summary TLDR
LEXAM is a multilingual benchmark built from 340 real law school exams to test long-form legal reasoning. The authors assemble thousands of questions (paper reports 4,886 total; detailed breakdown shows 2,841 open questions and a converted set of 1,660 MCQs), provide professor-written reference answers and stepwise guidance, and validate an ensemble “LLM-as-a-judge” that matches human experts. Results show top reasoning models reach ~70% on judged open answers and ~63% on MCQs but struggle on multi-step, structured legal reasoning and on perturbed MCQs with many distractors. Code and data are public.
Problem Statement
Current LLM benchmarks emphasize final-answer accuracy and STEM-style checks. Legal reasoning needs long-form, stepwise evaluation and reliable judges. The paper builds a dataset and an evaluation pipeline to test process-based legal reasoning and to validate whether LLM judges can substitute human experts.
Main Contribution
LEXAM dataset: law-school exams (340 exams) with professor solutions, multilingual (English/German) and fine-grained metadata.
Evaluation pipelines for open-ended and MCQ formats, including an expert-tuned LLM judge ensemble and human validation.
Baselines across 36 LLMs showing SOTA models still struggle on multi-step legal reasoning and perturbed MCQs.
Open release of code and data (project page, GitHub, Hugging Face) for reproducible evaluation.
Key Findings
Top reasoning models score substantially higher than others on long-form questions.
MCQ accuracy is lower than judged open-answer scores and drops with more distractors.
An ensemble LLM judge matches or exceeds human experts in consistency tests.
Models perform worse on German and Switzerland-specific questions.
Results
Open question judged score
Accuracy
Judge validation (Alt-test)
Robustness to distractors
Who Should Care
What To Try In 7 Days
Run your top models on a LEXAM subset to spot gaps in process reasoning (open answers).
Validate an LLM-judge ensemble on 50 expert-annotated items using the Alt-test before automating grading.
Stress-test MCQ-based workflows by adding distractor-heavy versions (8–32 choices) to detect guessing strategies.
Reproducibility
License
- CC BY 4.0
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dataset is Swiss-heavy and not a broad cross-jurisdiction benchmark (authors note expansion planned).
- English and German items are not parallel translations, so language and legal differences are confounded.
- No large-scale human performance baseline due to institutional limits; human data limited to small expert sample.
- Inconsistencies in reported MCQ counts across sections (paper reports 2,045 MCQs in places and 1,660 elsewhere).
When Not To Use
- Do not use LEXAM scores as proof of legal correctness for real cases or to replace lawyers.
- Avoid deploying models scored on LEXAM directly in production without expert oversight.
- Don't use MCQ-only evaluations as sole evidence of legal competence.
Failure Modes
- Hallucinated or incorrect statutory citations that look plausible.
- Poor multilingual (German) performance in smaller models, causing incoherent outputs.
- High sensitivity to MCQ formatting and the number/order of distractors, enabling guessing.
- Overconfident but shallow doctrinal reasoning that omits intermediate legal steps.
Core Entities
Models
- GPT-5
- Gemini-2.5-Pro
- GPT-4.1
- GPT-4o
- Claude-4.5-Sonnet
- DeepSeek-R1
- Qwen3-32B
- Llama-4-Maverick
Metrics
- Judge score (0–100 scaled 0.0–1.0 in prompts)
- Accuracy
- Bootstrap standard error
- Pearson r, quadratic weighted κ, MAE for human agreement
- Alt-test winning rate ω and advantage prob. ρ
Datasets
- LEXAM (this paper)
- COLIEE
- LegalBench
- LawBench
- LexGLUE
- MMLU (legal subset)
Benchmarks
- LEXAM open questions (judge score)
- Accuracy
- MCQ perturbation (4/8/16/32 choices)

