Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
24
Why It Matters For Business
Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.
Summary TLDR
The authors build DISC-LawLLM by supervised fine-tuning a Baichuan-13B base model on a 403K-sample, law-focused SFT dataset (DISC-Law-SFT) constructed with legal syllogism prompting. They add a document retriever over a local statutes/cases knowledge base (Top-K retrieval) so the model cites legal texts. They also release a mixed objective/subjective benchmark (DISC-Law-Eval) for multiple-choice and free-answer legal tests. On their benchmark DISC-LawLLM (13B) beats several open legal LLMs and outperforms GPT-3.5-turbo on average accuracy (37.10% vs 34.10% on objective tasks) and on subjective referee scores (average 3.39/5). Code, data, and weights are released on GitHub.
Problem Statement
Generic LLMs lack the specialized legal reasoning and up-to-date statute access needed for trustworthy legal services. The gap: teach an LLM legal syllogism (laws + facts -> conclusion) and add retrieval so answers cite current statutes and reduce hallucination.
Main Contribution
DISC-Law-SFT: a 403K supervised fine-tuning dataset for Chinese legal tasks built from public legal datasets, crawled legal text, and open instruction corpora using legal-syllogism prompting and LCoT.
DISC-LawLLM: a Baichuan-13B-based model fine-tuned on DISC-Law-SFT and adapted to incorporate retrieved statute/case references at inference.
DISC-Law-Eval: a benchmark combining objective multi-choice exams and subjective free-answer cases judged by GPT-3.5 to measure legal knowledge, reasoning, completeness and clarity.
Open release: datasets, model weights and retrieval code published to GitHub.
Key Findings
Large, law-specific SFT dataset built for training.
Fine-tuned 13B model improves objective benchmark accuracy over GPT-3.5-turbo on evaluated tests.
Big gains on specific hard multi-answer exams like NJE (multi-answer).
Subjective quality improved per LLM referee scores.
Retrieval added to reduce hallucinations and reference statutes.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Assemble 10k domain-specific Q&A and case snippets and run SFT on a 7–13B base model.
Add a small local statutes index and a Top-K retriever; prepend retrieved passages to prompts.
Build a short subjective test set and use an LLM judge (e.g., GPT-3.5) to get quick quality scores on accuracy/completeness/clarity.
Agent Features
Tool Use
- external retriever (Top-K)
Architectures
- decoder-only transformer
Optimization Features
Infra Optimization
- training on 8×A800 GPUs
System Optimization
- deepspeed used to reduce training cost
Training Optimization
- SFT
- 2 epochs, lr=5e-5, global batch size=256
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark and evaluation use in-house collections and GPT-3.5 referee which may introduce dataset bias and judge bias.
- Objective accuracy numbers remain modest in absolute terms (average accuracy ~37%), so outputs need expert review in practice.
- Evaluation focused on Chinese legal exams and consultations; results may not generalize to other legal systems or languages.
When Not To Use
- For final legal advice or binding decisions without lawyer review.
- In jurisdictions or languages not covered by the knowledge base.
- When up-to-the-minute statute changes are critical and retrieval latency or KB updates lag.
Failure Modes
- Hallucinated legal citations when retrieval fails or returns weak matches.
- Overconfident but incorrect legal reasoning on edge or ambiguous fact patterns.
- Bias or gaps from training data (exam-focused samples) leading to blind spots.
Core Entities
Models
- DISC-LawLLM
- Baichuan-13B-Base
- Baichuan-13B-Chat
- ChatGLM-6B
- GPT-3.5-turbo
- Chinese-alpaca2
- LawGPT
- ChatLaw
- Lawyer-LLaMA
- LexiLaw
Metrics
- Accuracy
- subjective ACC
- subjective CPL
- subjective CLR
- average score
Datasets
- SFT
- DISC-Law-Eval
- CAIL2018
- JEC-QA
- CJRC
- LEVEN
- CAIL2020-sfzy
- CAIL2022-yqzy
- Alpaca-GPT4
- Firefly
Benchmarks
- DISC-Law-Eval
Context Entities
Models
- GPT-4 (referenced)
- ChatGPT (referenced as judge capability)
Metrics
- few-shot matching extraction (answer parsing)
Datasets
- Lawyer-LLaMa datasets
- LawGPT-zh data
- COIG-PC

