Train LLMs to say “I don't know”: integrate unanswerability detection and RLHF to cut hallucinations to ~1%

July 22, 20257 min

Overview

Decision SnapshotNeeds Validation

Authors show consistent gains across automatic metrics, human evaluation, hallucination rates, and latency; main caveats are a single custom Chinese dataset and no public code or dataset link.

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng

Links

Abstract / PDF

Why It Matters For Business

Embedding unanswerability detection inside the LLM reduces harmful hallucinations and raises end-to-end trust, which is crucial for customer-facing QA and support bots where wrong answers cause reputational or legal risk.

Who Should Care

Summary TLDR

SALU trains a single LLM to both answer questions and explicitly abstain when the context lacks an answer. The method combines supervised multi-task fine-tuning (answer + abstain) with a confidence-score-guided RLHF phase that heavily penalizes confident hallucinations. On the authors' Chinese CIR Answerability dataset, SALU reaches 90.8% overall accuracy, 0.931 unanswerability F1, and reduces hallucination on unanswerable questions to 1.3%, while keeping practical inference latency (~485 ms).

Problem Statement

Generative LLMs often fabricate answers when the context lacks the needed information. External classifiers can detect unanswerability but split decision logic causes inconsistency and residual hallucination. We need an LLM that both generates answers and reliably abstains when appropriate.

Main Contribution

SALU: a multi-task fine-tuned LLM that outputs answers or a fixed abstention phrase to embed unanswerability detection directly in generation.

A confidence-score-guided RLHF stage that rewards correct abstention and strongly penalizes confident hallucinations.

Key Findings

Integrating unanswerability into the LLM improves end-to-end correctness.

NumbersOverall accuracy 0.908 vs hybrid baseline 0.847 (+0.061)

Practical UseUse an integrated approach rather than a separate classifier to raise the chance the system either answers correctly or abstains.

Evidence RefTable I: Overall Acc.

SALU achieves state-of-the-art unanswerability detection on the evaluated set.

NumbersUnanswerability F1 = 0.931 vs BERT-C 0.885 (+0.046)

Practical UseTrain the generative model with negative (no-answer) examples to improve detection, instead of relying only on external classifiers.

Evidence RefTable I: Unanswerability F1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.908FT for Standard QA w/ Post-hoc BERT-C = 0.847+0.061C-IR Answerability test setTable I overall accuracyTable I
Unanswerability F10.931BERT-C = 0.885+0.046C-IR Answerability test setTable I unanswerability F1Table I

What To Try In 7 Days

Add negative (no-answer) examples and fine-tune your LLM in a multi-task setup with a fixed abstention phrase; target ~50% NA ratio.

Train a small reward model and run a brief RLHF loop that penalizes confident incorrect answers and rewards correct abstentions.

Measure hallucination rate and overall answer-or-abstain accuracy on a held-out set and compare to a post-hoc classifier pipeline.

Optimization Features

Training Optimization
SFTRLHF with PPO policy optimizationreward-model learning from human preferences
Inference Optimization
integrated discriminative signal in single forward pass to avoid extra model calls

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is on a custom Chinese C-IR dataset; cross-lingual generality is untested.

Dataset and code are not provided in the paper, limiting reproducibility.

When Not To Use

When you cannot afford human labeling and RLHF compute budget.

If you need maximal recall on partial-information queries and cannot tolerate abstention.

Failure Modes

Over-abstention on questions requiring complex or implicit inference.

Residual errors on subtle semantic distinctions and very long-range dependencies.

Core Entities

Models

LLaMA-2 (referenced)Baichuan (referenced)BERT (BertForSequenceClassification)SALU (proposed LLM fine-tuned model)

Metrics

AccuracyUnanswerability PrecisionUnanswerability RecallUnanswerability F1Answerable QA F1Exact MatchHallucination Rate (%)Inference Latency (ms)Human Likert scores (Factuality, Fluency, Abstention, Hallucination avoidance)

Datasets

C-IR Answerability (authors' custom dataset)Existing Chinese QA datasets (extended for C-IR)

Benchmarks

C-IR Answerability test set