Overview
The paper uses a real legal dataset (100 test cases), standard automatic metrics and manual error examples; results show metric gains but clear hallucinations, so short‑term use should be human‑supervised.
Citations24
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
Abstractive models and LLMs can speed up drafting legal headnotes but still hallucinate people/dates/courts; use them for triage or first drafts with human review rather than final publication.
Who Should Care
Summary TLDR
The authors test general LLMs (Text‑Davinci‑003, ChatGPT) and legal abstractive models (Legal‑Pegasus, Legal‑LED, plus India‑fine‑tuned variants) on 100 Indian Supreme Court cases. Domain‑fine‑tuned abstractive models score slightly higher on ROUGE/METEOR/BLEU than extractive baselines, but abstractive outputs and LLM summaries still show frequent inconsistencies and hallucinated facts (wrong names, dates, courts). Best practice today: use human‑in‑the‑loop checking and domain fine‑tuning; avoid fully automatic publication of legal summaries.
Problem Statement
Can pre-trained abstractive summarization models and general LLMs be used off the shelf to produce accurate, trustworthy abstractive summaries of long legal case judgements? The paper asks whether these models match expert summaries and whether they introduce hallucinations or factual inconsistencies that would mislead legal users.
Main Contribution
Empirical comparison of LLMs, legal abstractive models, and extractive baselines on 100 Indian Supreme Court cases from the IN-Abs dataset.
Use of both standard matching metrics (ROUGE, METEOR, BLEU) and consistency metrics (SummaC, NEPrec, NumPrec).
Key Findings
Domain‑fine‑tuned abstractive models match expert summaries better than extractive models on measured metrics.
Abstractive models and LLMs produce measurable factual inconsistencies and hallucinations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-2 F1 (best abstractive vs best extractive) | LegLED‑IN 0.255 | BertSum 0.2311 | +0.0239 | IN-Abs test (100 docs) | Table 3 shows LegLED‑IN ROUGE‑2 F1 0.255 vs BertSum 0.2311 | Table 3 |
| ROUGE-L F1 (statistically significant) | LegLED‑IN 0.2711 | BertSum 0.2082 | +0.0629 | IN-Abs test (100 docs) | Table 3 entries marked * indicate significant improvement over extractive models | Table 3 |
What To Try In 7 Days
Run existing legal abstractive models on a small set of your local cases and compare with expert summaries.
Implement simple post checks: verify all named entities and numeric tokens against the source text.
Fine‑tune a pre-trained legal model on a small, curated set of in‑domain summaries (100s of documents) and re-evaluate consistency.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation depends on automatic NER (SpaCy) which struggles with some Indian legal names.
Chunking long documents creates boundary errors (merged or truncated sentences).
When Not To Use
Do not auto‑publish summaries for high‑stakes legal consumption without human review.
Avoid using out‑of‑the‑box LLM summaries as definitive case citations or facts.
Failure Modes
Hallucinated foreign courts/statutes and wrong years (e.g., US courts inserted into Indian cases).
Confusing lawyers' names with parties or judges, producing misleading role assignments.

