Off-the-shelf abstractive models and LLMs score well on matching metrics but still hallucinate in legal judgment summaries

June 2, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper uses a real legal dataset (100 test cases), standard automatic metrics and manual error examples; results show metric gains but clear hallucinations, so short‑term use should be human‑supervised.

Citations24

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 40%

Authors

Aniket Deroy, Kripabandhu Ghosh, Saptarshi Ghosh

Links

Abstract / PDF / Data

Why It Matters For Business

Abstractive models and LLMs can speed up drafting legal headnotes but still hallucinate people/dates/courts; use them for triage or first drafts with human review rather than final publication.

Who Should Care

Summary TLDR

The authors test general LLMs (Text‑Davinci‑003, ChatGPT) and legal abstractive models (Legal‑Pegasus, Legal‑LED, plus India‑fine‑tuned variants) on 100 Indian Supreme Court cases. Domain‑fine‑tuned abstractive models score slightly higher on ROUGE/METEOR/BLEU than extractive baselines, but abstractive outputs and LLM summaries still show frequent inconsistencies and hallucinated facts (wrong names, dates, courts). Best practice today: use human‑in‑the‑loop checking and domain fine‑tuning; avoid fully automatic publication of legal summaries.

Problem Statement

Can pre-trained abstractive summarization models and general LLMs be used off the shelf to produce accurate, trustworthy abstractive summaries of long legal case judgements? The paper asks whether these models match expert summaries and whether they introduce hallucinations or factual inconsistencies that would mislead legal users.

Main Contribution

Empirical comparison of LLMs, legal abstractive models, and extractive baselines on 100 Indian Supreme Court cases from the IN-Abs dataset.

Use of both standard matching metrics (ROUGE, METEOR, BLEU) and consistency metrics (SummaC, NEPrec, NumPrec).

Key Findings

Domain‑fine‑tuned abstractive models match expert summaries better than extractive models on measured metrics.

NumbersROUGE‑2 F1: LegLED‑IN 0.255 vs BertSum 0.2311 (Table 3)

Practical UseFine‑tune abstractive models on in‑domain legal data to improve closeness to expert headnotes; expect modest metric gains, not error elimination.

Evidence RefTable 3

Abstractive models and LLMs produce measurable factual inconsistencies and hallucinations.

NumbersSummaC (consistency): chatgpt‑summ 0.5762, davinci‑summ 0.6356; some abstractive models lower (e.g., LegLED 0.6563) (Tbl

Practical UseDo not deploy generated legal summaries without human review; add automatic checks for numbers and named entities to flag risky summaries.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-2 F1 (best abstractive vs best extractive)LegLED‑IN 0.255BertSum 0.2311+0.0239IN-Abs test (100 docs)Table 3 shows LegLED‑IN ROUGE‑2 F1 0.255 vs BertSum 0.2311Table 3
ROUGE-L F1 (statistically significant)LegLED‑IN 0.2711BertSum 0.2082+0.0629IN-Abs test (100 docs)Table 3 entries marked * indicate significant improvement over extractive modelsTable 3

What To Try In 7 Days

Run existing legal abstractive models on a small set of your local cases and compare with expert summaries.

Implement simple post checks: verify all named entities and numeric tokens against the source text.

Fine‑tune a pre-trained legal model on a small, curated set of in‑domain summaries (100s of documents) and re-evaluate consistency.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

http://www.liiofindia.org/in/cases/cen/INSC/Reference: Shukla et al. 2022 (IN-Abs dataset)

Risks & Boundaries

Limitations

Evaluation depends on automatic NER (SpaCy) which struggles with some Indian legal names.

Chunking long documents creates boundary errors (merged or truncated sentences).

When Not To Use

Do not auto‑publish summaries for high‑stakes legal consumption without human review.

Avoid using out‑of‑the‑box LLM summaries as definitive case citations or facts.

Failure Modes

Hallucinated foreign courts/statutes and wrong years (e.g., US courts inserted into Indian cases).

Confusing lawyers' names with parties or judges, producing misleading role assignments.

Core Entities

Models

Text‑Davinci‑003Turbo‑GPT‑3.5 (ChatGPT)Legal‑PegasusLegal‑LEDLegPegasus‑INLegLED‑INBertSumCaseSummarizerSummaRunner/RNN_RNN

Metrics

ROUGE-2ROUGE-LMETEORBLEUSummaCNEPrecNumPrec

Datasets

IN-Abs (7,130 judgements; train 7,030; test 100)

Context Entities

Models

LegPegasus trained on SEC litigation releases (US)LegLED based on Longformer encoder-decoder

Metrics

Accuracy

Datasets

Legal Information Institute of India case pages (source of IN-Abs)