Overview
LEGALBENCH is a practical, lawyer-curated few-shot benchmark that surfaces per-task strengths and brittle failure modes; use it to triage tasks before deployment.
Citations28
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
LEGALBENCH gives legal teams and ML practitioners a practical suite to test LLMs on many lawyer-defined tasks before deployment, exposing brittle cases, prompt sensitivity, and task-by-task risk.
Who Should Care
Summary TLDR
LEGALBENCH is an open, lawyer-driven benchmark of 162 short English legal tasks (from 36 sources) organized into six legal reasoning types: issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, and rhetorical-understanding. The authors evaluate 20 LLMs (open-source and commercial). Key takeaways: GPT-4 leads across most categories; open-source models can match commercial models on some tasks; performance varies strongly by task type, prompt wording, and in‑context examples; many legal tasks remain brittle (hallucination, jurisdictional errors, long-doc limits). The benchmark, prompts, and grading guides are on GitHub to enable replication and extension.
Problem Statement
Existing legal benchmarks either focus on finetuning settings or blur distinct forms of legal reasoning. Lawyers and model builders need a task suite that: (a) breaks legal reasoning into lawyer-friendly categories, (b) works in few-shot prompting settings used with modern LLMs, and (c) is crowdsourced and validated by legal experts.
Main Contribution
LEGALBENCH: 162 few-shot tasks covering six lawyer-centered reasoning types, collected from 36 sources and hand-curated by legal professionals.
A typology mapping common legal tasks to six reasoning categories (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical-understanding).
Key Findings
GPT-4 is the strongest model across most legal reasoning categories in this evaluation.
Rule-application explanations require separate evaluation for correctness and for useful analysis; GPT-4 substantially outperformed other APIs on both.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Issue 82.9; Rule-recall 59.2; Rule-conclusion 89.9; Interpretation 75.2; Rhetorical 79.4 | — | — | LEGALBENCH aggregated (Table 2) | Table 2; Section 5.2 | — |
| Rule-application (manual) GPT-4 | Correctness 82.2%, Analysis 79.7% | GPT-3.5 correctness 58.5%, analysis 44.2% | Correctness +23.7 pts over GPT-3.5 | rule-application tasks (Table 3) | Table 3; Section 5.1.3 | — |
What To Try In 7 Days
Run LEGALBENCH's interpretation and rule-conclusion tasks on your candidate models to find where they fail (use the provided prompts).
Do a 1–2 day prompt sweep: compare plain vs technical wording and 5 random in-context demo sets to find stable prompts.
If using an open-source model, test Flan-T5-XXL and a 13B Instruct model on your most common tasks before buying API capacity.
Reproducibility
Risks & Boundaries
Limitations
Skewed to English, U.S. law, and contract-style interpretation tasks; poor coverage of international/multilingual law.
Focuses on short inputs; does not evaluate long-document reasoning or full multi-hop IRAC answers.
When Not To Use
To decide deployment safety for long, multi-document legal work without in-domain tests.
To evaluate subjective legal advice or tasks where reasonable minds may differ.
Failure Modes
Hallucinating statutes or case holdings (rule-recall errors).
Mis-applying jurisdiction-specific rules or missing jurisdiction anchor.

