LegalBench: 162 lawyer-crafted tasks to test LLM legal reasoning

August 20, 20238 min

Overview

Decision SnapshotNeeds Validation

LEGALBENCH is a practical, lawyer-curated few-shot benchmark that surfaces per-task strengths and brittle failure modes; use it to triage tasks before deployment.

Citations28

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LEGALBENCH gives legal teams and ML practitioners a practical suite to test LLMs on many lawyer-defined tasks before deployment, exposing brittle cases, prompt sensitivity, and task-by-task risk.

Who Should Care

Summary TLDR

LEGALBENCH is an open, lawyer-driven benchmark of 162 short English legal tasks (from 36 sources) organized into six legal reasoning types: issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, and rhetorical-understanding. The authors evaluate 20 LLMs (open-source and commercial). Key takeaways: GPT-4 leads across most categories; open-source models can match commercial models on some tasks; performance varies strongly by task type, prompt wording, and in‑context examples; many legal tasks remain brittle (hallucination, jurisdictional errors, long-doc limits). The benchmark, prompts, and grading guides are on GitHub to enable replication and extension.

Problem Statement

Existing legal benchmarks either focus on finetuning settings or blur distinct forms of legal reasoning. Lawyers and model builders need a task suite that: (a) breaks legal reasoning into lawyer-friendly categories, (b) works in few-shot prompting settings used with modern LLMs, and (c) is crowdsourced and validated by legal experts.

Main Contribution

LEGALBENCH: 162 few-shot tasks covering six lawyer-centered reasoning types, collected from 36 sources and hand-curated by legal professionals.

A typology mapping common legal tasks to six reasoning categories (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical-understanding).

Key Findings

GPT-4 is the strongest model across most legal reasoning categories in this evaluation.

NumbersIssue 82.9, Rule-recall 59.2, Conclusion 89.9, Interpretation 75.2, Rhetorical 79.4 (balanced-accuracy, Table 2)

Practical UseIf you need off-the-shelf few-shot legal performance today, favor top API models like GPT-4 and validate on the exact task you plan to deploy.

Evidence RefTable 2; Section 5.2

Rule-application explanations require separate evaluation for correctness and for useful analysis; GPT-4 substantially outperformed other APIs on both.

NumbersRule-application correctness 82.2%, analysis 79.7% (GPT-4) vs correctness 58.5/61.4 and analysis 44.2/59 (GPT-3.5/Claude

Practical UseFor applications that must explain legal reasoning, test both factual correctness and whether the model cites the needed inferences; GPT-4 gives stronger explanations in this benchmark.

Evidence RefTable 3; Section 5.2/5.3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyIssue 82.9; Rule-recall 59.2; Rule-conclusion 89.9; Interpretation 75.2; Rhetorical 79.4LEGALBENCH aggregated (Table 2)Table 2; Section 5.2
Rule-application (manual) GPT-4Correctness 82.2%, Analysis 79.7%GPT-3.5 correctness 58.5%, analysis 44.2%Correctness +23.7 pts over GPT-3.5rule-application tasks (Table 3)Table 3; Section 5.1.3

What To Try In 7 Days

Run LEGALBENCH's interpretation and rule-conclusion tasks on your candidate models to find where they fail (use the provided prompts).

Do a 1–2 day prompt sweep: compare plain vs technical wording and 5 random in-context demo sets to find stable prompts.

If using an open-source model, test Flan-T5-XXL and a 13B Instruct model on your most common tasks before buying API capacity.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Skewed to English, U.S. law, and contract-style interpretation tasks; poor coverage of international/multilingual law.

Focuses on short inputs; does not evaluate long-document reasoning or full multi-hop IRAC answers.

When Not To Use

To decide deployment safety for long, multi-document legal work without in-domain tests.

To evaluate subjective legal advice or tasks where reasonable minds may differ.

Failure Modes

Hallucinating statutes or case holdings (rule-recall errors).

Mis-applying jurisdiction-specific rules or missing jurisdiction anchor.

Core Entities

Models

GPT-4GPT-3.5 (text-davinci-003)Claude-1Flan-T5-XXLLLaMA-2-13BIncite-Instruct-7BOPT-13BVicuna-13B-16k

Metrics

Accuracyexact-matchF1manual correctnessmanual analysis

Datasets

LEGALBENCHCUADContractNLIMAUDOPP-115SARASCALRSSLA

Benchmarks

LEGALBENCH

Context Entities

Models

Flan-T5-XLLLaMA-2-7BFalcon-7B-InstructIncite-3B-Instruct

Metrics

Accuracycorrectness & analysis (rule-application)

Datasets

Learned HandsAPP-350 (privacy clauses)GlobalCit (international citizenship)

Benchmarks

BigBenchHELM