Overview
The benchmark and judge are well-documented and validated against experts, making them useful for research and controlled evaluation; they do not make LLMs production-safe for legal work without human oversight.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
License: CC BY 4.0
At A Glance
Cost impact: 40%
Production readiness: 35%
Novelty: 60%
Why It Matters For Business
LEXAM exposes where LLMs still fail on long-form, high-stakes legal tasks and gives a validated, scalable judge to grade open answers—useful for vendor selection, risk assessments, and controlled piloting of legal AI tools.
Who Should Care
Summary TLDR
LEXAM is a multilingual benchmark built from 340 real law school exams to test long-form legal reasoning. The authors assemble thousands of questions (paper reports 4,886 total; detailed breakdown shows 2,841 open questions and a converted set of 1,660 MCQs), provide professor-written reference answers and stepwise guidance, and validate an ensemble “LLM-as-a-judge” that matches human experts. Results show top reasoning models reach ~70% on judged open answers and ~63% on MCQs but struggle on multi-step, structured legal reasoning and on perturbed MCQs with many distractors. Code and data are public.
Problem Statement
Current LLM benchmarks emphasize final-answer accuracy and STEM-style checks. Legal reasoning needs long-form, stepwise evaluation and reliable judges. The paper builds a dataset and an evaluation pipeline to test process-based legal reasoning and to validate whether LLM judges can substitute human experts.
Main Contribution
LEXAM dataset: law-school exams (340 exams) with professor solutions, multilingual (English/German) and fine-grained metadata.
Evaluation pipelines for open-ended and MCQ formats, including an expert-tuned LLM judge ensemble and human validation.
Key Findings
Top reasoning models score substantially higher than others on long-form questions.
MCQ accuracy is lower than judged open-answer scores and drops with more distractors.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Open question judged score | GPT-5: 70.20 (±0.41) by ensemble judge | — | — | LEXAM open test set (2,541 test + 300 dev) | Table 1 reports ensemble-judged scores for open questions | Table 1 |
| Accuracy | GPT-5: 62.65% (±1.17) | random ≈25% | ≈+37.6 pp vs random | LEXAM MCQs (1,660) | Table 11 | Table 11 |
What To Try In 7 Days
Run your top models on a LEXAM subset to spot gaps in process reasoning (open answers).
Validate an LLM-judge ensemble on 50 expert-annotated items using the Alt-test before automating grading.
Stress-test MCQ-based workflows by adding distractor-heavy versions (8–32 choices) to detect guessing strategies.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Dataset is Swiss-heavy and not a broad cross-jurisdiction benchmark (authors note expansion planned).
English and German items are not parallel translations, so language and legal differences are confounded.
When Not To Use
Do not use LEXAM scores as proof of legal correctness for real cases or to replace lawyers.
Avoid deploying models scored on LEXAM directly in production without expert oversight.
Failure Modes
Hallucinated or incorrect statutory citations that look plausible.
Poor multilingual (German) performance in smaller models, causing incoherent outputs.

