A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Overview

Decision SnapshotNeeds Validation

The method shows solid gains on a 1,988-sample benchmark with human validation and ablations. It requires compute for iterative finetuning and lawyer review for dataset curation, so readiness is medium; evidence is strong for the evaluated settings.

Citations0

Evidence Strength0.80

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 55%

Authors

Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Legal applications need accurate citations and factual advice; this training recipe reduces fabricated statutes and raises answer usefulness, lowering legal risk and manual review costs.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

The authors introduce LegalHalBench, three automatic legal-factuality metrics, and a two-stage fine-tuning pipeline: supervised fine-tuning (SFT) on an automated statute-aware QA corpus, followed by Hard sample-aware Iterative Direct Preference Optimization (HIPO). On their benchmark, HIPO+SFT (GLM4-Chat-9B) raises the Non-Hallucinated Statute Rate (NHSR) to 38.35% and improves relevance and truthfulness metrics vs base models. They release code and the benchmark. Metrics show high alignment with lawyer judgments.

Problem Statement

General LLMs and existing legal LLMs sometimes fabricate law names, numbers, provisions, or give advice that contradicts regulations. There is no dedicated benchmark and few targeted finetuning recipes to reduce legal hallucinations in question answering.

Main Contribution

LegalHalBench: a 1,988-sample benchmark for legal QA with reference answers and statutes, covering civil and criminal law.

Three automatic hallucination metrics: Non-Hallucinated Statute Rate (NHSR), Statute Relevance Rate (Rel), and Legal Claim Truthfulness (T_LC).

Key Findings

HIPO with SFT greatly increases correct statute citation rate on GLM4-Chat-9B.

NumbersNHSR 38.353% (w/ SFT+HIPO) vs 6.541% (vanilla) on LegalHalBench

Practical UseIf you fine-tune GLM4-Chat-9B with SFT then HIPO, expect a large practical reduction in fabricated statute citations on similar legal QA tasks.

Evidence RefTable 1, Table 3

Statute relevance and claim truthfulness improved after HIPO.

NumbersRel 7.025 and T_LC 9.079 (GLM4 w/ SFT+HIPO); Rel improved by ~37.13% over base

Practical UseHIPO not only reduces hallucinated statutes but also boosts the relevance of cited law and the factuality of legal claims; use it when you need both citation accuracy and helpful answers.

Evidence RefTable 1, Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Non-Hallucinated Statute Rate (NHSR)	38.353% (GLM4-Chat-9B, w/ SFT+HIPO)	6.541% (GLM4-Chat-9B vanilla)	+31.812 percentage points	LegalHalBench	Table 1 and Table 3 report NHSR improvements after SFT and HIPO	Table 1, Table 3
Statute Relevance Rate (Rel)	7.025 (GLM4-Chat-9B, w/ SFT+HIPO; scale 0-10)	5.123 (GLM4-Chat-9B vanilla)	+1.902 (≈37.13% relative increase)	LegalHalBench	Table 1 and Section 5.2 quantify relevance gains	Table 1

What To Try In 7 Days

Run LegalHalBench on your model to measure NHSR, Rel, and T_LC.

Create a small statute-aware SFT set: pair questions with correct statute contexts.

Apply LoRA SFT, then one HIPO iteration using automatic NHSR-based filtering; compare NHSR and BERTScore.

Optimization Features

Training Optimization

SFTHard sample-aware iterative Direct Preference Optimization (HIPO)LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/YinghaoHu/LegalHalBench

Data URLs

https://github.com/YinghaoHu/LegalHalBench

Risks & Boundaries

Limitations

Improvements plateau after ~3 HIPO iterations; more iterations give diminishing or negative returns.

Training relies on LLM-generated data plus retrieval-based statute correction; coverage may miss rare statutes.

When Not To Use

When you need real-time retrieval-only solutions rather than model-internal knowledge.

When you lack access to a reliable statute database for semantic replacement.

Failure Modes

Model still fabricates statutes outside the training distribution.

Conflicts between retrieved statutes and internal model knowledge can reduce answer quality.

Core Entities

Models

GLM4-Chat-9BQwen2-Instruct-7BLlama3.1-70BLlama3.1-405BGPT4oGPT4o-miniwisdomInterrogatoryLexiLaw

Metrics

NHSRStatute Relevance Rate (Rel)Legal Claim Truthfulness (T_LC)METEORBERTScoreROUGE-L

Datasets

LegalHalBenchCAIL2018Lawbench

Benchmarks

LegalHalBenchLawbench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

HIPO with SFT greatly increases correct statute citation rate on GLM4-Chat-9B.

Statute relevance and claim truthfulness improved after HIPO.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding