Overview
Production Readiness
0.3
Novelty Score
0.35
Cost Impact Score
0.4
Citation Count
19
Why It Matters For Business
LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.
Summary TLDR
LawBench is a focused benchmark that tests large language models on 20 Chinese legal tasks across three skills: memorizing statutes, understanding legal text, and applying law to cases. The authors evaluate 51 LLMs (general, Chinese-oriented, and legal-specific). GPT-4 leads but scores remain far from human-ready: GPT-4 averages ~52% zero-shot. Fine-tuning on legal data helps, scaling improves one-shot performance, and simply appending law text (retrieval) often hurts model answers. All data, predictions and code are released on GitHub.
Problem Statement
We lack a systematic, Chinese-law-specific benchmark that measures whether LLMs actually store legal rules, read legal text accurately, and apply law to real cases. Existing tests (bar exams, English datasets) miss the Chinese civil-law needs and realistic tasks.
Main Contribution
Design and release of LawBench: 20 tasks (SLC, MLC, regression, extraction, generation) mapped to three cognitive levels: memorization, understanding, applying.
Large-scale evaluation of 51 LLMs (multilingual, Chinese-oriented, legal-specific) in zero-shot and one-shot settings using task-specific answer-extraction rules.
Analyses of scaling, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), legal fine-tuning, retrieval augmentation, and data contamination risks.
Open release of data, model predictions and evaluation code via OpenCompass for reproducibility.
Key Findings
GPT-4 is the best model on LawBench but far from perfect
General Chinese-oriented LLMs often beat small legal-specific LLMs
Simple retrieval (adding article text) usually degrades performance
Supervised fine-tuning (SFT) helps; RLHF can raise abstentions and hurt task scores
Scaling helps one-shot performance more consistently than zero-shot
Data contamination is a real risk
Results
GPT-4 average (zero-shot)
ChatGPT average (zero-shot)
Top legal-specific model average (zero-shot)
Task: Prison term prediction w.o article (3-4)
Task: Article recitation (1-1)
Who Should Care
What To Try In 7 Days
Run your use-case tasks on LawBench or sample tasks to estimate model gaps quickly.
If using open models, try supervised fine-tuning (SFT) on a small curated legal instruction set and re-evaluate.
Test retrieval formats (not just appending laws) and measure whether retrieval helps or harms your outputs.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible test data leakage: models may have seen training data or near-duplicate examples.
- Evaluation for generative tasks relies on Rouge-L and hand-crafted extraction rules, which miss legal nuance.
- Long-document tasks (legal retrieval) are not included due to token limits.
When Not To Use
- Do not use these models for unattended legal advice or final judgments.
- Avoid trusting raw model outputs where legal liability exists without expert review.
- Do not assume retrieval-by-appending law text will improve answers without testing.
Failure Modes
- Hallucination: plausible but incorrect legal citations or reasoning.
- Misuse of retrieved context: appended law text can confuse models and lower accuracy.
- Abstention shifts: RLHF variants may refuse questions and reduce coverage.
- Data contamination: inflated scores when test data leaked into training.
Core Entities
Models
- GPT-4
- ChatGPT
- Qwen-Chat
- InternLM-Chat
- Baichuan-13B-Chat
- LLaMA-2
- Ziya-LLaMA-13B
- Fuzi-Mingcha
- ChatLaw-13B
- LexiLaw
- HanFei
- LaWGPT
Metrics
- Accuracy
- F1
- rc-F1
- soft-F1
- nLog-distance
- F0.5
- Rouge-L
Datasets
- CAIL2018
- CAIL2019
- CAIL2021
- CAIL2022
- JEC-QA
- LEVEN
- LAIC2021
- FLK (national law database)
- CrimeKgAssistant
Benchmarks
- LawBench
- LEGALBENCH
- lexglue
Context Entities
Models
- LLaMA
- Vicuna
- StableBeluga2
- MPT
- ChatGLM2
- Baichuan-7B
- BELLE-LLaMA-2
Metrics
- abstention rate
Datasets
- AiStudio marriage dataset
- CrimeKgAssitant
- public scenario Q&A generated via chatGPT (used then manually filtered)
Benchmarks
- MMLU
- HELM
- OpenCompass

