Train LLMs to say “I don't know”: integrate unanswerability detection and RLHF to cut hallucinations to ~1%

July 22, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng

Links

Abstract / PDF

Why It Matters For Business

Embedding unanswerability detection inside the LLM reduces harmful hallucinations and raises end-to-end trust, which is crucial for customer-facing QA and support bots where wrong answers cause reputational or legal risk.

Summary TLDR

SALU trains a single LLM to both answer questions and explicitly abstain when the context lacks an answer. The method combines supervised multi-task fine-tuning (answer + abstain) with a confidence-score-guided RLHF phase that heavily penalizes confident hallucinations. On the authors' Chinese CIR Answerability dataset, SALU reaches 90.8% overall accuracy, 0.931 unanswerability F1, and reduces hallucination on unanswerable questions to 1.3%, while keeping practical inference latency (~485 ms).

Problem Statement

Generative LLMs often fabricate answers when the context lacks the needed information. External classifiers can detect unanswerability but split decision logic causes inconsistency and residual hallucination. We need an LLM that both generates answers and reliably abstains when appropriate.

Main Contribution

SALU: a multi-task fine-tuned LLM that outputs answers or a fixed abstention phrase to embed unanswerability detection directly in generation.

A confidence-score-guided RLHF stage that rewards correct abstention and strongly penalizes confident hallucinations.

A new Chinese C-IR Answerability dataset with sentence/paragraph/ranked-list answerability labels and a set of evaluation protocols.

Key Findings

Integrating unanswerability into the LLM improves end-to-end correctness.

NumbersOverall accuracy 0.908 vs hybrid baseline 0.847 (+0.061)

SALU achieves state-of-the-art unanswerability detection on the evaluated set.

NumbersUnanswerability F1 = 0.931 vs BERT-C 0.885 (+0.046)

RLHF dramatically cuts hallucinations on unanswerable queries.

NumbersHallucination rate drops to 1.3% from 88.7% (FT QA baseline)

Balanced training data mix matters for best trade-off.

Numbers50% NA examples yields best Overall Acc 0.908 and Unanswerability F1 0.931

Results

Accuracy

Value0.908

BaselineFT for Standard QA w/ Post-hoc BERT-C = 0.847

Unanswerability F1

Value0.931

BaselineBERT-C = 0.885

Hallucination rate on unanswerable questions

Value1.3%

BaselineFT for Standard QA = 88.7%

Inference latency (average per query)

Value485 ms

BaselineFT for Standard QA w/ Post-hoc BERT-C = 530 ms

Human eval - Appropriateness of abstention (5-point scale)

Value4.8

BaselineFT for Standard QA w/ Post-hoc BERT-C = 4.0

Who Should Care

What To Try In 7 Days

Add negative (no-answer) examples and fine-tune your LLM in a multi-task setup with a fixed abstention phrase; target ~50% NA ratio.

Train a small reward model and run a brief RLHF loop that penalizes confident incorrect answers and rewards correct abstentions.

Measure hallucination rate and overall answer-or-abstain accuracy on a held-out set and compare to a post-hoc classifier pipeline.

Optimization Features

Training Optimization

  • SFT
  • RLHF with PPO policy optimization
  • reward-model learning from human preferences

Inference Optimization

  • integrated discriminative signal in single forward pass to avoid extra model calls

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation is on a custom Chinese C-IR dataset; cross-lingual generality is untested.
  • Dataset and code are not provided in the paper, limiting reproducibility.
  • Method can be over-conservative (over-abstention) on subtle or highly implicit answers.
  • RLHF requires human preference labels and extra compute, increasing cost and complexity.
  • Reward model bias can shape abstention behavior in unintended ways.

When Not To Use

  • When you cannot afford human labeling and RLHF compute budget.
  • If you need maximal recall on partial-information queries and cannot tolerate abstention.
  • When operating in a language/domain not covered by the training data without adaptation.

Failure Modes

  • Over-abstention on questions requiring complex or implicit inference.
  • Residual errors on subtle semantic distinctions and very long-range dependencies.
  • Bias introduced by the reward model or annotator preferences.
  • Dependence on the quality and domain coverage of retrieved passages.

Core Entities

Models

  • LLaMA-2 (referenced)
  • Baichuan (referenced)
  • BERT (BertForSequenceClassification)
  • SALU (proposed LLM fine-tuned model)

Metrics

  • Accuracy
  • Unanswerability Precision
  • Unanswerability Recall
  • Unanswerability F1
  • Answerable QA F1
  • Exact Match
  • Hallucination Rate (%)
  • Inference Latency (ms)
  • Human Likert scores (Factuality, Fluency, Abstention, Hallucination avoidance)

Datasets

  • C-IR Answerability (authors' custom dataset)
  • Existing Chinese QA datasets (extended for C-IR)

Benchmarks

  • C-IR Answerability test set