Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

September 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The system shows concrete improvements on an in-house benchmark and releases code/data; real-world deployment requires legal validation, retrieval maintenance, and human oversight.

Citations24

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 50%

Authors

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, Zhongyu Wei

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Who Should Care

Summary TLDR

The authors build DISC-LawLLM by supervised fine-tuning a Baichuan-13B base model on a 403K-sample, law-focused SFT dataset (DISC-Law-SFT) constructed with legal syllogism prompting. They add a document retriever over a local statutes/cases knowledge base (Top-K retrieval) so the model cites legal texts. They also release a mixed objective/subjective benchmark (DISC-Law-Eval) for multiple-choice and free-answer legal tests. On their benchmark DISC-LawLLM (13B) beats several open legal LLMs and outperforms GPT-3.5-turbo on average accuracy (37.10% vs 34.10% on objective tasks) and on subjective referee scores (average 3.39/5). Code, data, and weights are released on GitHub.

Problem Statement

Generic LLMs lack the specialized legal reasoning and up-to-date statute access needed for trustworthy legal services. The gap: teach an LLM legal syllogism (laws + facts -> conclusion) and add retrieval so answers cite current statutes and reduce hallucination.

Main Contribution

DISC-Law-SFT: a 403K supervised fine-tuning dataset for Chinese legal tasks built from public legal datasets, crawled legal text, and open instruction corpora using legal-syllogism prompting and LCoT.

DISC-LawLLM: a Baichuan-13B-based model fine-tuned on DISC-Law-SFT and adapted to incorporate retrieved statute/case references at inference.

Key Findings

Large, law-specific SFT dataset built for training.

NumbersDISC-Law-SFT total size 403K samples

Practical UseIf you need a legal-tuned model, start with a diverse 100k+ SFT mix from annotated tasks, raw law text, and curated instructions.

Evidence RefTable 1; Sec.3.4

Fine-tuned 13B model improves objective benchmark accuracy over GPT-3.5-turbo on evaluated tests.

NumbersAverage accuracy: DISC-LawLLM 37.10% vs GPT-3.5 34.10%+3.0)

Practical UseDomain SFT can yield modest absolute accuracy gains on legal exam style questions; expect small-to-moderate improvements over instruction-tuned large models.

Evidence RefTable 2; Sec.6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy37.10%GPT-3.5-turbo 34.10%+3.00DISC-Law-Eval (objective multiple-choice)Table 2 reports average accuracies across subjectsTable 2; Sec.6.1
Accuracy42.09%GPT-3.5-turbo 36.5%+5.59NJE (Hard, single-answer)Table 2 NJE S columnTable 2; Sec.6.1

What To Try In 7 Days

Assemble 10k domain-specific Q&A and case snippets and run SFT on a 7–13B base model.

Add a small local statutes index and a Top-K retriever; prepend retrieved passages to prompts.

Build a short subjective test set and use an LLM judge (e.g., GPT-3.5) to get quick quality scores on accuracy/completeness/clarity.

Agent Features

Tool Use
external retriever (Top-K)
Architectures
decoder-only transformer

Optimization Features

Infra Optimization
training on 8×A800 GPUs
System Optimization
deepspeed used to reduce training cost
Training Optimization
SFT2 epochs, lr=5e-5, global batch size=256

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark and evaluation use in-house collections and GPT-3.5 referee which may introduce dataset bias and judge bias.

Objective accuracy numbers remain modest in absolute terms (average accuracy ~37%), so outputs need expert review in practice.

When Not To Use

For final legal advice or binding decisions without lawyer review.

In jurisdictions or languages not covered by the knowledge base.

Failure Modes

Hallucinated legal citations when retrieval fails or returns weak matches.

Overconfident but incorrect legal reasoning on edge or ambiguous fact patterns.

Core Entities

Models

DISC-LawLLMBaichuan-13B-BaseBaichuan-13B-ChatChatGLM-6BGPT-3.5-turboChinese-alpaca2LawGPTChatLawLawyer-LLaMALexiLaw

Metrics

Accuracysubjective ACCsubjective CPLsubjective CLRaverage score

Datasets

SFTDISC-Law-EvalCAIL2018JEC-QACJRCLEVENCAIL2020-sfzyCAIL2022-yqzyAlpaca-GPT4Firefly

Benchmarks

DISC-Law-Eval

Context Entities

Models

GPT-4 (referenced)ChatGPT (referenced as judge capability)

Metrics

few-shot matching extraction (answer parsing)

Datasets

Lawyer-LLaMa datasetsLawGPT-zh dataCOIG-PC