LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

September 28, 20238 min

Overview

Decision SnapshotNeeds Validation

The benchmark and results are credible and actionable for model comparisons, but real-world legal deployment needs extra validation due to leakage and evaluation limits.

Citations19

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 35%

Authors

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Who Should Care

Summary TLDR

LawBench is a focused benchmark that tests large language models on 20 Chinese legal tasks across three skills: memorizing statutes, understanding legal text, and applying law to cases. The authors evaluate 51 LLMs (general, Chinese-oriented, and legal-specific). GPT-4 leads but scores remain far from human-ready: GPT-4 averages ~52% zero-shot. Fine-tuning on legal data helps, scaling improves one-shot performance, and simply appending law text (retrieval) often hurts model answers. All data, predictions and code are released on GitHub.

Problem Statement

We lack a systematic, Chinese-law-specific benchmark that measures whether LLMs actually store legal rules, read legal text accurately, and apply law to real cases. Existing tests (bar exams, English datasets) miss the Chinese civil-law needs and realistic tasks.

Main Contribution

Design and release of LawBench: 20 tasks (SLC, MLC, regression, extraction, generation) mapped to three cognitive levels: memorization, understanding, applying.

Large-scale evaluation of 51 LLMs (multilingual, Chinese-oriented, legal-specific) in zero-shot and one-shot settings using task-specific answer-extraction rules.

Key Findings

GPT-4 is the best model on LawBench but far from perfect

NumbersGPT-4 average zero-shot 52.35 (Table 26)

Practical UseUse GPT-4 for stronger baseline legal assistance, but expect many errors; do not treat outputs as reliable legal judgments.

Evidence RefTable 26, Figure 3

General Chinese-oriented LLMs often beat small legal-specific LLMs

NumbersTop Chinese chat models (Qwen-Chat, InternLM-Chat) outperform many legal LLMs (Table 26)

Practical UseFine-tuning a high-quality base model is more effective than legal fine-tuning of a weak base; start with a strong foundation model.

Evidence RefTable 26, Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 average (zero-shot)52.35LawBench overallTable 26Table 26
ChatGPT average (zero-shot)42.15LawBench overallTable 26Table 26

What To Try In 7 Days

Run your use-case tasks on LawBench or sample tasks to estimate model gaps quickly.

If using open models, try supervised fine-tuning (SFT) on a small curated legal instruction set and re-evaluate.

Test retrieval formats (not just appending laws) and measure whether retrieval helps or harms your outputs.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Possible test data leakage: models may have seen training data or near-duplicate examples.

Evaluation for generative tasks relies on Rouge-L and hand-crafted extraction rules, which miss legal nuance.

When Not To Use

Do not use these models for unattended legal advice or final judgments.

Avoid trusting raw model outputs where legal liability exists without expert review.

Failure Modes

Hallucination: plausible but incorrect legal citations or reasoning.

Misuse of retrieved context: appended law text can confuse models and lower accuracy.

Core Entities

Models

GPT-4ChatGPTQwen-ChatInternLM-ChatBaichuan-13B-ChatLLaMA-2Ziya-LLaMA-13BFuzi-MingchaChatLaw-13BLexiLawHanFeiLaWGPT

Metrics

AccuracyF1rc-F1soft-F1nLog-distanceF0.5Rouge-L

Datasets

CAIL2018CAIL2019CAIL2021CAIL2022JEC-QALEVENLAIC2021FLK (national law database)CrimeKgAssistant

Benchmarks

LawBenchLEGALBENCHlexglue

Context Entities

Models

LLaMAVicunaStableBeluga2MPTChatGLM2Baichuan-7BBELLE-LLaMA-2

Metrics

abstention rate

Datasets

AiStudio marriage datasetCrimeKgAssitantpublic scenario Q&A generated via chatGPT (used then manually filtered)

Benchmarks

MMLUHELMOpenCompass