LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark and results are credible and actionable for model comparisons, but real-world legal deployment needs extra validation due to leakage and evaluation limits.

Citations19

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 35%

Authors

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Who Should Care

Product Manager ML Engineer Founder CTO Data Scientist

Summary TLDR

LawBench is a focused benchmark that tests large language models on 20 Chinese legal tasks across three skills: memorizing statutes, understanding legal text, and applying law to cases. The authors evaluate 51 LLMs (general, Chinese-oriented, and legal-specific). GPT-4 leads but scores remain far from human-ready: GPT-4 averages ~52% zero-shot. Fine-tuning on legal data helps, scaling improves one-shot performance, and simply appending law text (retrieval) often hurts model answers. All data, predictions and code are released on GitHub.

Problem Statement

We lack a systematic, Chinese-law-specific benchmark that measures whether LLMs actually store legal rules, read legal text accurately, and apply law to real cases. Existing tests (bar exams, English datasets) miss the Chinese civil-law needs and realistic tasks.

Main Contribution

Design and release of LawBench: 20 tasks (SLC, MLC, regression, extraction, generation) mapped to three cognitive levels: memorization, understanding, applying.

Large-scale evaluation of 51 LLMs (multilingual, Chinese-oriented, legal-specific) in zero-shot and one-shot settings using task-specific answer-extraction rules.

Key Findings

GPT-4 is the best model on LawBench but far from perfect

NumbersGPT-4 average zero-shot 52.35 (Table 26)

Practical UseUse GPT-4 for stronger baseline legal assistance, but expect many errors; do not treat outputs as reliable legal judgments.

Evidence RefTable 26, Figure 3

General Chinese-oriented LLMs often beat small legal-specific LLMs

NumbersTop Chinese chat models (Qwen-Chat, InternLM-Chat) outperform many legal LLMs (Table 26)

Practical UseFine-tuning a high-quality base model is more effective than legal fine-tuning of a weak base; start with a strong foundation model.

Evidence RefTable 26, Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 average (zero-shot)	52.35	—	—	LawBench overall	Table 26	Table 26
ChatGPT average (zero-shot)	42.15	—	—	LawBench overall	Table 26	Table 26

What To Try In 7 Days

Run your use-case tasks on LawBench or sample tasks to estimate model gaps quickly.

If using open models, try supervised fine-tuning (SFT) on a small curated legal instruction set and re-evaluate.

Test retrieval formats (not just appending laws) and measure whether retrieval helps or harms your outputs.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/open-compass/LawBench/

Data URLs

https://github.com/open-compass/LawBench/

Risks & Boundaries

Limitations

Possible test data leakage: models may have seen training data or near-duplicate examples.

Evaluation for generative tasks relies on Rouge-L and hand-crafted extraction rules, which miss legal nuance.

When Not To Use

Do not use these models for unattended legal advice or final judgments.

Avoid trusting raw model outputs where legal liability exists without expert review.

Failure Modes

Hallucination: plausible but incorrect legal citations or reasoning.

Misuse of retrieved context: appended law text can confuse models and lower accuracy.

Core Entities

Models

GPT-4ChatGPTQwen-ChatInternLM-ChatBaichuan-13B-ChatLLaMA-2Ziya-LLaMA-13BFuzi-MingchaChatLaw-13BLexiLawHanFeiLaWGPT

Metrics

AccuracyF1rc-F1soft-F1nLog-distanceF0.5Rouge-L

Datasets

CAIL2018CAIL2019CAIL2021CAIL2022JEC-QALEVENLAIC2021FLK (national law database)CrimeKgAssistant

Benchmarks

LawBenchLEGALBENCHlexglue

Context Entities

Models

LLaMAVicunaStableBeluga2MPTChatGLM2Baichuan-7BBELLE-LLaMA-2

Metrics

abstention rate

Datasets

AiStudio marriage datasetCrimeKgAssitantpublic scenario Q&A generated via chatGPT (used then manually filtered)

Benchmarks

MMLUHELMOpenCompass

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 is the best model on LawBench but far from perfect

General Chinese-oriented LLMs often beat small legal-specific LLMs

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding