Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

Overview

Decision SnapshotNeeds Validation

The system shows concrete improvements on an in-house benchmark and releases code/data; real-world deployment requires legal validation, retrieval maintenance, and human oversight.

Citations24

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 50%

Authors

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, Zhongyu Wei

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors build DISC-LawLLM by supervised fine-tuning a Baichuan-13B base model on a 403K-sample, law-focused SFT dataset (DISC-Law-SFT) constructed with legal syllogism prompting. They add a document retriever over a local statutes/cases knowledge base (Top-K retrieval) so the model cites legal texts. They also release a mixed objective/subjective benchmark (DISC-Law-Eval) for multiple-choice and free-answer legal tests. On their benchmark DISC-LawLLM (13B) beats several open legal LLMs and outperforms GPT-3.5-turbo on average accuracy (37.10% vs 34.10% on objective tasks) and on subjective referee scores (average 3.39/5). Code, data, and weights are released on GitHub.

Problem Statement

Generic LLMs lack the specialized legal reasoning and up-to-date statute access needed for trustworthy legal services. The gap: teach an LLM legal syllogism (laws + facts -> conclusion) and add retrieval so answers cite current statutes and reduce hallucination.

Main Contribution

DISC-Law-SFT: a 403K supervised fine-tuning dataset for Chinese legal tasks built from public legal datasets, crawled legal text, and open instruction corpora using legal-syllogism prompting and LCoT.

DISC-LawLLM: a Baichuan-13B-based model fine-tuned on DISC-Law-SFT and adapted to incorporate retrieved statute/case references at inference.

Key Findings

Large, law-specific SFT dataset built for training.

NumbersDISC-Law-SFT total size 403K samples

Practical UseIf you need a legal-tuned model, start with a diverse 100k+ SFT mix from annotated tasks, raw law text, and curated instructions.

Evidence RefTable 1; Sec.3.4

Fine-tuned 13B model improves objective benchmark accuracy over GPT-3.5-turbo on evaluated tests.

NumbersAverage accuracy: DISC-LawLLM 37.10% vs GPT-3.5 34.10% (Δ +3.0)

Practical UseDomain SFT can yield modest absolute accuracy gains on legal exam style questions; expect small-to-moderate improvements over instruction-tuned large models.

Evidence RefTable 2; Sec.6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	37.10%	GPT-3.5-turbo 34.10%	+3.00	DISC-Law-Eval (objective multiple-choice)	Table 2 reports average accuracies across subjects	Table 2; Sec.6.1
Accuracy	42.09%	GPT-3.5-turbo 36.5%	+5.59	NJE (Hard, single-answer)	Table 2 NJE S column	Table 2; Sec.6.1

What To Try In 7 Days

Assemble 10k domain-specific Q&A and case snippets and run SFT on a 7–13B base model.

Add a small local statutes index and a Top-K retriever; prepend retrieved passages to prompts.

Build a short subjective test set and use an LLM judge (e.g., GPT-3.5) to get quick quality scores on accuracy/completeness/clarity.

Agent Features

Tool Use

external retriever (Top-K)

Architectures

decoder-only transformer

Optimization Features

Infra Optimization

training on 8×A800 GPUs

System Optimization

deepspeed used to reduce training cost

Training Optimization

SFT2 epochs, lr=5e-5, global batch size=256

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FudanDISC/DISC-LawLLM

Data URLs

https://github.com/FudanDISC/DISC-LawLLM

Risks & Boundaries

Limitations

Benchmark and evaluation use in-house collections and GPT-3.5 referee which may introduce dataset bias and judge bias.

Objective accuracy numbers remain modest in absolute terms (average accuracy ~37%), so outputs need expert review in practice.

When Not To Use

For final legal advice or binding decisions without lawyer review.

In jurisdictions or languages not covered by the knowledge base.

Failure Modes

Hallucinated legal citations when retrieval fails or returns weak matches.

Overconfident but incorrect legal reasoning on edge or ambiguous fact patterns.

Core Entities

Models

DISC-LawLLMBaichuan-13B-BaseBaichuan-13B-ChatChatGLM-6BGPT-3.5-turboChinese-alpaca2LawGPTChatLawLawyer-LLaMALexiLaw

Metrics

Accuracysubjective ACCsubjective CPLsubjective CLRaverage score

Datasets

SFTDISC-Law-EvalCAIL2018JEC-QACJRCLEVENCAIL2020-sfzyCAIL2022-yqzyAlpaca-GPT4Firefly

Benchmarks

DISC-Law-Eval

Context Entities

Models

GPT-4 (referenced)ChatGPT (referenced as judge capability)

Metrics

few-shot matching extraction (answer parsing)

Datasets

Lawyer-LLaMa datasetsLawGPT-zh dataCOIG-PC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large, law-specific SFT dataset built for training.

Fine-tuned 13B model improves objective benchmark accuracy over GPT-3.5-turbo on evaluated tests.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding