LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

September 28, 20238 min

Overview

Production Readiness

0.3

Novelty Score

0.35

Cost Impact Score

0.4

Citation Count

19

Authors

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge

Links

Abstract / PDF

Why It Matters For Business

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Summary TLDR

LawBench is a focused benchmark that tests large language models on 20 Chinese legal tasks across three skills: memorizing statutes, understanding legal text, and applying law to cases. The authors evaluate 51 LLMs (general, Chinese-oriented, and legal-specific). GPT-4 leads but scores remain far from human-ready: GPT-4 averages ~52% zero-shot. Fine-tuning on legal data helps, scaling improves one-shot performance, and simply appending law text (retrieval) often hurts model answers. All data, predictions and code are released on GitHub.

Problem Statement

We lack a systematic, Chinese-law-specific benchmark that measures whether LLMs actually store legal rules, read legal text accurately, and apply law to real cases. Existing tests (bar exams, English datasets) miss the Chinese civil-law needs and realistic tasks.

Main Contribution

Design and release of LawBench: 20 tasks (SLC, MLC, regression, extraction, generation) mapped to three cognitive levels: memorization, understanding, applying.

Large-scale evaluation of 51 LLMs (multilingual, Chinese-oriented, legal-specific) in zero-shot and one-shot settings using task-specific answer-extraction rules.

Analyses of scaling, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), legal fine-tuning, retrieval augmentation, and data contamination risks.

Open release of data, model predictions and evaluation code via OpenCompass for reproducibility.

Key Findings

GPT-4 is the best model on LawBench but far from perfect

NumbersGPT-4 average zero-shot 52.35 (Table 26)

General Chinese-oriented LLMs often beat small legal-specific LLMs

NumbersTop Chinese chat models (Qwen-Chat, InternLM-Chat) outperform many legal LLMs (Table 26)

Simple retrieval (adding article text) usually degrades performance

NumbersMost models drop on prison-term task when article content appended (Figure 5)

Supervised fine-tuning (SFT) helps; RLHF can raise abstentions and hurt task scores

NumbersSFT models improve scores; RLHF variants show higher abstention and dips (Figure 6)

Scaling helps one-shot performance more consistently than zero-shot

NumbersLarger models improve more in one-shot than zero-shot (Figure 4)

Data contamination is a real risk

NumbersFuzi-Mingcha scored 97.59 on task 2-5 while others <65, suggesting leakage (Table 3)

Results

GPT-4 average (zero-shot)

Value52.35

ChatGPT average (zero-shot)

Value42.15

Top legal-specific model average (zero-shot)

Value33.05

Baselinecompared to GPT-4 52.35

Task: Prison term prediction w.o article (3-4)

ValueHigh performance across models (example: GPT-4 82.62)

Task: Article recitation (1-1)

ValueVery low scores across models (GPT-4 15.38)

Who Should Care

What To Try In 7 Days

Run your use-case tasks on LawBench or sample tasks to estimate model gaps quickly.

If using open models, try supervised fine-tuning (SFT) on a small curated legal instruction set and re-evaluate.

Test retrieval formats (not just appending laws) and measure whether retrieval helps or harms your outputs.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible test data leakage: models may have seen training data or near-duplicate examples.
  • Evaluation for generative tasks relies on Rouge-L and hand-crafted extraction rules, which miss legal nuance.
  • Long-document tasks (legal retrieval) are not included due to token limits.

When Not To Use

  • Do not use these models for unattended legal advice or final judgments.
  • Avoid trusting raw model outputs where legal liability exists without expert review.
  • Do not assume retrieval-by-appending law text will improve answers without testing.

Failure Modes

  • Hallucination: plausible but incorrect legal citations or reasoning.
  • Misuse of retrieved context: appended law text can confuse models and lower accuracy.
  • Abstention shifts: RLHF variants may refuse questions and reduce coverage.
  • Data contamination: inflated scores when test data leaked into training.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • Qwen-Chat
  • InternLM-Chat
  • Baichuan-13B-Chat
  • LLaMA-2
  • Ziya-LLaMA-13B
  • Fuzi-Mingcha
  • ChatLaw-13B
  • LexiLaw
  • HanFei
  • LaWGPT

Metrics

  • Accuracy
  • F1
  • rc-F1
  • soft-F1
  • nLog-distance
  • F0.5
  • Rouge-L

Datasets

  • CAIL2018
  • CAIL2019
  • CAIL2021
  • CAIL2022
  • JEC-QA
  • LEVEN
  • LAIC2021
  • FLK (national law database)
  • CrimeKgAssistant

Benchmarks

  • LawBench
  • LEGALBENCH
  • lexglue

Context Entities

Models

  • LLaMA
  • Vicuna
  • StableBeluga2
  • MPT
  • ChatGLM2
  • Baichuan-7B
  • BELLE-LLaMA-2

Metrics

  • abstention rate

Datasets

  • AiStudio marriage dataset
  • CrimeKgAssitant
  • public scenario Q&A generated via chatGPT (used then manually filtered)

Benchmarks

  • MMLU
  • HELM
  • OpenCompass