Train LLMs to say “I don't know”: integrate unanswerability detection and RLHF to cut hallucinations to ~1%

Overview

Decision SnapshotNeeds Validation

Authors show consistent gains across automatic metrics, human evaluation, hallucination rates, and latency; main caveats are a single custom Chinese dataset and no public code or dataset link.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng

Links

Abstract / PDF

Why It Matters For Business

Embedding unanswerability detection inside the LLM reduces harmful hallucinations and raises end-to-end trust, which is crucial for customer-facing QA and support bots where wrong answers cause reputational or legal risk.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

SALU trains a single LLM to both answer questions and explicitly abstain when the context lacks an answer. The method combines supervised multi-task fine-tuning (answer + abstain) with a confidence-score-guided RLHF phase that heavily penalizes confident hallucinations. On the authors' Chinese CIR Answerability dataset, SALU reaches 90.8% overall accuracy, 0.931 unanswerability F1, and reduces hallucination on unanswerable questions to 1.3%, while keeping practical inference latency (~485 ms).

Problem Statement

Generative LLMs often fabricate answers when the context lacks the needed information. External classifiers can detect unanswerability but split decision logic causes inconsistency and residual hallucination. We need an LLM that both generates answers and reliably abstains when appropriate.

Main Contribution

SALU: a multi-task fine-tuned LLM that outputs answers or a fixed abstention phrase to embed unanswerability detection directly in generation.

A confidence-score-guided RLHF stage that rewards correct abstention and strongly penalizes confident hallucinations.

Key Findings

Integrating unanswerability into the LLM improves end-to-end correctness.

NumbersOverall accuracy 0.908 vs hybrid baseline 0.847 (+0.061)

Practical UseUse an integrated approach rather than a separate classifier to raise the chance the system either answers correctly or abstains.

Evidence RefTable I: Overall Acc.

SALU achieves state-of-the-art unanswerability detection on the evaluated set.

NumbersUnanswerability F1 = 0.931 vs BERT-C 0.885 (+0.046)

Practical UseTrain the generative model with negative (no-answer) examples to improve detection, instead of relying only on external classifiers.

Evidence RefTable I: Unanswerability F1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.908	FT for Standard QA w/ Post-hoc BERT-C = 0.847	+0.061	C-IR Answerability test set	Table I overall accuracy	Table I
Unanswerability F1	0.931	BERT-C = 0.885	+0.046	C-IR Answerability test set	Table I unanswerability F1	Table I

What To Try In 7 Days

Add negative (no-answer) examples and fine-tune your LLM in a multi-task setup with a fixed abstention phrase; target ~50% NA ratio.

Train a small reward model and run a brief RLHF loop that penalizes confident incorrect answers and rewards correct abstentions.

Measure hallucination rate and overall answer-or-abstain accuracy on a held-out set and compare to a post-hoc classifier pipeline.

Optimization Features

Training Optimization

SFTRLHF with PPO policy optimizationreward-model learning from human preferences

Inference Optimization

integrated discriminative signal in single forward pass to avoid extra model calls

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is on a custom Chinese C-IR dataset; cross-lingual generality is untested.

Dataset and code are not provided in the paper, limiting reproducibility.

When Not To Use

When you cannot afford human labeling and RLHF compute budget.

If you need maximal recall on partial-information queries and cannot tolerate abstention.

Failure Modes

Over-abstention on questions requiring complex or implicit inference.

Residual errors on subtle semantic distinctions and very long-range dependencies.

Core Entities

Models

LLaMA-2 (referenced)Baichuan (referenced)BERT (BertForSequenceClassification)SALU (proposed LLM fine-tuned model)

Metrics

AccuracyUnanswerability PrecisionUnanswerability RecallUnanswerability F1Answerable QA F1Exact MatchHallucination Rate (%)Inference Latency (ms)Human Likert scores (Factuality, Fluency, Abstention, Hallucination avoidance)

Datasets

C-IR Answerability (authors' custom dataset)Existing Chinese QA datasets (extended for C-IR)

Benchmarks

C-IR Answerability test set

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Integrating unanswerability into the LLM improves end-to-end correctness.

SALU achieves state-of-the-art unanswerability detection on the evaluated set.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

Key finding

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

Key finding

Reduce multimodal model hallucinations by learning from segment-level human corrections

Key finding

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

Key finding

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Key finding