Overview
The benchmark and results are credible and actionable for model comparisons, but real-world legal deployment needs extra validation due to leakage and evaluation limits.
Citations19
Evidence Strength0.80
Confidence0.86
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 35%
Why It Matters For Business
LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.
Who Should Care
Summary TLDR
LawBench is a focused benchmark that tests large language models on 20 Chinese legal tasks across three skills: memorizing statutes, understanding legal text, and applying law to cases. The authors evaluate 51 LLMs (general, Chinese-oriented, and legal-specific). GPT-4 leads but scores remain far from human-ready: GPT-4 averages ~52% zero-shot. Fine-tuning on legal data helps, scaling improves one-shot performance, and simply appending law text (retrieval) often hurts model answers. All data, predictions and code are released on GitHub.
Problem Statement
We lack a systematic, Chinese-law-specific benchmark that measures whether LLMs actually store legal rules, read legal text accurately, and apply law to real cases. Existing tests (bar exams, English datasets) miss the Chinese civil-law needs and realistic tasks.
Main Contribution
Design and release of LawBench: 20 tasks (SLC, MLC, regression, extraction, generation) mapped to three cognitive levels: memorization, understanding, applying.
Large-scale evaluation of 51 LLMs (multilingual, Chinese-oriented, legal-specific) in zero-shot and one-shot settings using task-specific answer-extraction rules.
Key Findings
GPT-4 is the best model on LawBench but far from perfect
General Chinese-oriented LLMs often beat small legal-specific LLMs
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4 average (zero-shot) | 52.35 | — | — | LawBench overall | Table 26 | Table 26 |
| ChatGPT average (zero-shot) | 42.15 | — | — | LawBench overall | Table 26 | Table 26 |
What To Try In 7 Days
Run your use-case tasks on LawBench or sample tasks to estimate model gaps quickly.
If using open models, try supervised fine-tuning (SFT) on a small curated legal instruction set and re-evaluate.
Test retrieval formats (not just appending laws) and measure whether retrieval helps or harms your outputs.
Reproducibility
Risks & Boundaries
Limitations
Possible test data leakage: models may have seen training data or near-duplicate examples.
Evaluation for generative tasks relies on Rouge-L and hand-crafted extraction rules, which miss legal nuance.
When Not To Use
Do not use these models for unattended legal advice or final judgments.
Avoid trusting raw model outputs where legal liability exists without expert review.
Failure Modes
Hallucination: plausible but incorrect legal citations or reasoning.
Misuse of retrieved context: appended law text can confuse models and lower accuracy.

