Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.65
Citation Count
0
Why It Matters For Business
Legal applications need accurate citations and factual advice; this training recipe reduces fabricated statutes and raises answer usefulness, lowering legal risk and manual review costs.
Summary TLDR
The authors introduce LegalHalBench, three automatic legal-factuality metrics, and a two-stage fine-tuning pipeline: supervised fine-tuning (SFT) on an automated statute-aware QA corpus, followed by Hard sample-aware Iterative Direct Preference Optimization (HIPO). On their benchmark, HIPO+SFT (GLM4-Chat-9B) raises the Non-Hallucinated Statute Rate (NHSR) to 38.35% and improves relevance and truthfulness metrics vs base models. They release code and the benchmark. Metrics show high alignment with lawyer judgments.
Problem Statement
General LLMs and existing legal LLMs sometimes fabricate law names, numbers, provisions, or give advice that contradicts regulations. There is no dedicated benchmark and few targeted finetuning recipes to reduce legal hallucinations in question answering.
Main Contribution
LegalHalBench: a 1,988-sample benchmark for legal QA with reference answers and statutes, covering civil and criminal law.
Three automatic hallucination metrics: Non-Hallucinated Statute Rate (NHSR), Statute Relevance Rate (Rel), and Legal Claim Truthfulness (T_LC).
A two-stage training pipeline: supervised fine-tuning (SFT) on an automated statute-aware dataset, then HIPO — hard sample-aware iterative Direct Preference Optimization.
An automated dataset curation pipeline combining LLM generation, semantic retrieval of real statutes, and human review; ~12k+3.9k training samples.
Extensive experiments and ablations showing HIPO improves factuality and usefulness metrics versus SFT, DPO, SimPO, and some retrieval baselines.
Key Findings
HIPO with SFT greatly increases correct statute citation rate on GLM4-Chat-9B.
Statute relevance and claim truthfulness improved after HIPO.
Helpfulness metrics (METEOR, BERTScore, ROUGE-L) improved substantially with SFT+HIPO.
Automated hallucination metrics align well with human experts.
Iterating HIPO improves results but gains plateau by the 3rd round.
Results
Non-Hallucinated Statute Rate (NHSR)
Statute Relevance Rate (Rel)
Legal Claim Truthfulness (T_LC)
METEOR / BERTScore / ROUGE-L
Who Should Care
What To Try In 7 Days
Run LegalHalBench on your model to measure NHSR, Rel, and T_LC.
Create a small statute-aware SFT set: pair questions with correct statute contexts.
Apply LoRA SFT, then one HIPO iteration using automatic NHSR-based filtering; compare NHSR and BERTScore.
Optimization Features
Training Optimization
- SFT
- Hard sample-aware iterative Direct Preference Optimization (HIPO)
- LoRA
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Improvements plateau after ~3 HIPO iterations; more iterations give diminishing or negative returns.
- Training relies on LLM-generated data plus retrieval-based statute correction; coverage may miss rare statutes.
- Metric and training depend on GPT-4-turbo for generation and judging; evaluator bias may influence results.
- Constructing high-quality dataset required legal expert time (200+ hours) despite automation.
When Not To Use
- When you need real-time retrieval-only solutions rather than model-internal knowledge.
- When you lack access to a reliable statute database for semantic replacement.
- When you cannot afford iterative finetuning compute or human legal review.
Failure Modes
- Model still fabricates statutes outside the training distribution.
- Conflicts between retrieved statutes and internal model knowledge can reduce answer quality.
- Overfitting to statute phrasing, producing verbatim citations without correct application.
- Metric or GPT-4 evaluator preferences may not generalize to end-user legal standards.
Core Entities
Models
- GLM4-Chat-9B
- Qwen2-Instruct-7B
- Llama3.1-70B
- Llama3.1-405B
- GPT4o
- GPT4o-mini
- wisdomInterrogatory
- LexiLaw
Metrics
- NHSR
- Statute Relevance Rate (Rel)
- Legal Claim Truthfulness (T_LC)
- METEOR
- BERTScore
- ROUGE-L
Datasets
- LegalHalBench
- CAIL2018
- Lawbench
Benchmarks
- LegalHalBench
- Lawbench

