A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

January 11, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

0

Authors

Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu

Links

Abstract / PDF

Why It Matters For Business

Legal applications need accurate citations and factual advice; this training recipe reduces fabricated statutes and raises answer usefulness, lowering legal risk and manual review costs.

Summary TLDR

The authors introduce LegalHalBench, three automatic legal-factuality metrics, and a two-stage fine-tuning pipeline: supervised fine-tuning (SFT) on an automated statute-aware QA corpus, followed by Hard sample-aware Iterative Direct Preference Optimization (HIPO). On their benchmark, HIPO+SFT (GLM4-Chat-9B) raises the Non-Hallucinated Statute Rate (NHSR) to 38.35% and improves relevance and truthfulness metrics vs base models. They release code and the benchmark. Metrics show high alignment with lawyer judgments.

Problem Statement

General LLMs and existing legal LLMs sometimes fabricate law names, numbers, provisions, or give advice that contradicts regulations. There is no dedicated benchmark and few targeted finetuning recipes to reduce legal hallucinations in question answering.

Main Contribution

LegalHalBench: a 1,988-sample benchmark for legal QA with reference answers and statutes, covering civil and criminal law.

Three automatic hallucination metrics: Non-Hallucinated Statute Rate (NHSR), Statute Relevance Rate (Rel), and Legal Claim Truthfulness (T_LC).

A two-stage training pipeline: supervised fine-tuning (SFT) on an automated statute-aware dataset, then HIPO — hard sample-aware iterative Direct Preference Optimization.

An automated dataset curation pipeline combining LLM generation, semantic retrieval of real statutes, and human review; ~12k+3.9k training samples.

Extensive experiments and ablations showing HIPO improves factuality and usefulness metrics versus SFT, DPO, SimPO, and some retrieval baselines.

Key Findings

HIPO with SFT greatly increases correct statute citation rate on GLM4-Chat-9B.

NumbersNHSR 38.353% (w/ SFT+HIPO) vs 6.541% (vanilla) on LegalHalBench

Statute relevance and claim truthfulness improved after HIPO.

NumbersRel 7.025 and T_LC 9.079 (GLM4 w/ SFT+HIPO); Rel improved by ~37.13% over base

Helpfulness metrics (METEOR, BERTScore, ROUGE-L) improved substantially with SFT+HIPO.

NumbersGLM4 w/ SFT+HIPO METEOR 0.407 (+42.8%), BERTScore 0.762 (+7.3%), ROUGE-L 0.340 (+126.7%) vs vanilla

Automated hallucination metrics align well with human experts.

NumbersNHSR accuracy 98%; Rel Spearman ρ=0.820; T_LC Spearman ρ=0.617

Iterating HIPO improves results but gains plateau by the 3rd round.

NumbersGLM4 NHSR: SFT 26.94% → +HIPOM1 32.41% → +HIPOM2 33.92% → +HIPOM3 38.35%

Results

Non-Hallucinated Statute Rate (NHSR)

Value38.353% (GLM4-Chat-9B, w/ SFT+HIPO)

Baseline6.541% (GLM4-Chat-9B vanilla)

Statute Relevance Rate (Rel)

Value7.025 (GLM4-Chat-9B, w/ SFT+HIPO; scale 0-10)

Baseline5.123 (GLM4-Chat-9B vanilla)

Legal Claim Truthfulness (T_LC)

Value9.079 (GLM4-Chat-9B, w/ SFT+HIPO; scale 0-10)

Baseline8.520 (GLM4-Chat-9B vanilla)

METEOR / BERTScore / ROUGE-L

Value0.407 / 0.762 / 0.340 (GLM4-Chat-9B, w/ SFT+HIPO)

Baseline0.285 / 0.710 / 0.150 (GLM4-Chat-9B vanilla)

Who Should Care

What To Try In 7 Days

Run LegalHalBench on your model to measure NHSR, Rel, and T_LC.

Create a small statute-aware SFT set: pair questions with correct statute contexts.

Apply LoRA SFT, then one HIPO iteration using automatic NHSR-based filtering; compare NHSR and BERTScore.

Optimization Features

Training Optimization

  • SFT
  • Hard sample-aware iterative Direct Preference Optimization (HIPO)
  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Improvements plateau after ~3 HIPO iterations; more iterations give diminishing or negative returns.
  • Training relies on LLM-generated data plus retrieval-based statute correction; coverage may miss rare statutes.
  • Metric and training depend on GPT-4-turbo for generation and judging; evaluator bias may influence results.
  • Constructing high-quality dataset required legal expert time (200+ hours) despite automation.

When Not To Use

  • When you need real-time retrieval-only solutions rather than model-internal knowledge.
  • When you lack access to a reliable statute database for semantic replacement.
  • When you cannot afford iterative finetuning compute or human legal review.

Failure Modes

  • Model still fabricates statutes outside the training distribution.
  • Conflicts between retrieved statutes and internal model knowledge can reduce answer quality.
  • Overfitting to statute phrasing, producing verbatim citations without correct application.
  • Metric or GPT-4 evaluator preferences may not generalize to end-user legal standards.

Core Entities

Models

  • GLM4-Chat-9B
  • Qwen2-Instruct-7B
  • Llama3.1-70B
  • Llama3.1-405B
  • GPT4o
  • GPT4o-mini
  • wisdomInterrogatory
  • LexiLaw

Metrics

  • NHSR
  • Statute Relevance Rate (Rel)
  • Legal Claim Truthfulness (T_LC)
  • METEOR
  • BERTScore
  • ROUGE-L

Datasets

  • LegalHalBench
  • CAIL2018
  • Lawbench

Benchmarks

  • LegalHalBench
  • Lawbench