Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

September 20, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

24

Authors

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, Zhongyu Wei

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Summary TLDR

The authors build DISC-LawLLM by supervised fine-tuning a Baichuan-13B base model on a 403K-sample, law-focused SFT dataset (DISC-Law-SFT) constructed with legal syllogism prompting. They add a document retriever over a local statutes/cases knowledge base (Top-K retrieval) so the model cites legal texts. They also release a mixed objective/subjective benchmark (DISC-Law-Eval) for multiple-choice and free-answer legal tests. On their benchmark DISC-LawLLM (13B) beats several open legal LLMs and outperforms GPT-3.5-turbo on average accuracy (37.10% vs 34.10% on objective tasks) and on subjective referee scores (average 3.39/5). Code, data, and weights are released on GitHub.

Problem Statement

Generic LLMs lack the specialized legal reasoning and up-to-date statute access needed for trustworthy legal services. The gap: teach an LLM legal syllogism (laws + facts -> conclusion) and add retrieval so answers cite current statutes and reduce hallucination.

Main Contribution

DISC-Law-SFT: a 403K supervised fine-tuning dataset for Chinese legal tasks built from public legal datasets, crawled legal text, and open instruction corpora using legal-syllogism prompting and LCoT.

DISC-LawLLM: a Baichuan-13B-based model fine-tuned on DISC-Law-SFT and adapted to incorporate retrieved statute/case references at inference.

DISC-Law-Eval: a benchmark combining objective multi-choice exams and subjective free-answer cases judged by GPT-3.5 to measure legal knowledge, reasoning, completeness and clarity.

Open release: datasets, model weights and retrieval code published to GitHub.

Key Findings

Large, law-specific SFT dataset built for training.

NumbersDISC-Law-SFT total size 403K samples

Fine-tuned 13B model improves objective benchmark accuracy over GPT-3.5-turbo on evaluated tests.

NumbersAverage accuracy: DISC-LawLLM 37.10% vs GPT-3.5 34.10% (Δ +3.0)

Big gains on specific hard multi-answer exams like NJE (multi-answer).

NumbersNJE multi-answer: DISC-LawLLM 19.87% vs GPT-3.5 10.58% (Δ +9.29)

Subjective quality improved per LLM referee scores.

NumbersDISC-LawLLM avg subjective score 3.39/5 vs ChatGLM 2.87/5 (Δ +0.52)

Retrieval added to reduce hallucinations and reference statutes.

NumbersKnowledge base covers >50 law categories; model uses Top-K retrieved docs

Results

Accuracy

Value37.10%

BaselineGPT-3.5-turbo 34.10%

Accuracy

Value42.09%

BaselineGPT-3.5-turbo 36.5%

Accuracy

Value19.87%

BaselineGPT-3.5-turbo 10.58%

Accuracy

Value3.39 / 5

BaselineChatGLM 2.87 / 5

Who Should Care

What To Try In 7 Days

Assemble 10k domain-specific Q&A and case snippets and run SFT on a 7–13B base model.

Add a small local statutes index and a Top-K retriever; prepend retrieved passages to prompts.

Build a short subjective test set and use an LLM judge (e.g., GPT-3.5) to get quick quality scores on accuracy/completeness/clarity.

Agent Features

Tool Use

  • external retriever (Top-K)

Architectures

  • decoder-only transformer

Optimization Features

Infra Optimization

  • training on 8×A800 GPUs

System Optimization

  • deepspeed used to reduce training cost

Training Optimization

  • SFT
  • 2 epochs, lr=5e-5, global batch size=256

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark and evaluation use in-house collections and GPT-3.5 referee which may introduce dataset bias and judge bias.
  • Objective accuracy numbers remain modest in absolute terms (average accuracy ~37%), so outputs need expert review in practice.
  • Evaluation focused on Chinese legal exams and consultations; results may not generalize to other legal systems or languages.

When Not To Use

  • For final legal advice or binding decisions without lawyer review.
  • In jurisdictions or languages not covered by the knowledge base.
  • When up-to-the-minute statute changes are critical and retrieval latency or KB updates lag.

Failure Modes

  • Hallucinated legal citations when retrieval fails or returns weak matches.
  • Overconfident but incorrect legal reasoning on edge or ambiguous fact patterns.
  • Bias or gaps from training data (exam-focused samples) leading to blind spots.

Core Entities

Models

  • DISC-LawLLM
  • Baichuan-13B-Base
  • Baichuan-13B-Chat
  • ChatGLM-6B
  • GPT-3.5-turbo
  • Chinese-alpaca2
  • LawGPT
  • ChatLaw
  • Lawyer-LLaMA
  • LexiLaw

Metrics

  • Accuracy
  • subjective ACC
  • subjective CPL
  • subjective CLR
  • average score

Datasets

  • SFT
  • DISC-Law-Eval
  • CAIL2018
  • JEC-QA
  • CJRC
  • LEVEN
  • CAIL2020-sfzy
  • CAIL2022-yqzy
  • Alpaca-GPT4
  • Firefly

Benchmarks

  • DISC-Law-Eval

Context Entities

Models

  • GPT-4 (referenced)
  • ChatGPT (referenced as judge capability)

Metrics

  • few-shot matching extraction (answer parsing)

Datasets

  • Lawyer-LLaMa datasets
  • LawGPT-zh data
  • COIG-PC