Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Embedding unanswerability detection inside the LLM reduces harmful hallucinations and raises end-to-end trust, which is crucial for customer-facing QA and support bots where wrong answers cause reputational or legal risk.
Summary TLDR
SALU trains a single LLM to both answer questions and explicitly abstain when the context lacks an answer. The method combines supervised multi-task fine-tuning (answer + abstain) with a confidence-score-guided RLHF phase that heavily penalizes confident hallucinations. On the authors' Chinese CIR Answerability dataset, SALU reaches 90.8% overall accuracy, 0.931 unanswerability F1, and reduces hallucination on unanswerable questions to 1.3%, while keeping practical inference latency (~485 ms).
Problem Statement
Generative LLMs often fabricate answers when the context lacks the needed information. External classifiers can detect unanswerability but split decision logic causes inconsistency and residual hallucination. We need an LLM that both generates answers and reliably abstains when appropriate.
Main Contribution
SALU: a multi-task fine-tuned LLM that outputs answers or a fixed abstention phrase to embed unanswerability detection directly in generation.
A confidence-score-guided RLHF stage that rewards correct abstention and strongly penalizes confident hallucinations.
A new Chinese C-IR Answerability dataset with sentence/paragraph/ranked-list answerability labels and a set of evaluation protocols.
Key Findings
Integrating unanswerability into the LLM improves end-to-end correctness.
SALU achieves state-of-the-art unanswerability detection on the evaluated set.
RLHF dramatically cuts hallucinations on unanswerable queries.
Balanced training data mix matters for best trade-off.
Results
Accuracy
Unanswerability F1
Hallucination rate on unanswerable questions
Inference latency (average per query)
Human eval - Appropriateness of abstention (5-point scale)
Who Should Care
What To Try In 7 Days
Add negative (no-answer) examples and fine-tune your LLM in a multi-task setup with a fixed abstention phrase; target ~50% NA ratio.
Train a small reward model and run a brief RLHF loop that penalizes confident incorrect answers and rewards correct abstentions.
Measure hallucination rate and overall answer-or-abstain accuracy on a held-out set and compare to a post-hoc classifier pipeline.
Optimization Features
Training Optimization
- SFT
- RLHF with PPO policy optimization
- reward-model learning from human preferences
Inference Optimization
- integrated discriminative signal in single forward pass to avoid extra model calls
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation is on a custom Chinese C-IR dataset; cross-lingual generality is untested.
- Dataset and code are not provided in the paper, limiting reproducibility.
- Method can be over-conservative (over-abstention) on subtle or highly implicit answers.
- RLHF requires human preference labels and extra compute, increasing cost and complexity.
- Reward model bias can shape abstention behavior in unintended ways.
When Not To Use
- When you cannot afford human labeling and RLHF compute budget.
- If you need maximal recall on partial-information queries and cannot tolerate abstention.
- When operating in a language/domain not covered by the training data without adaptation.
Failure Modes
- Over-abstention on questions requiring complex or implicit inference.
- Residual errors on subtle semantic distinctions and very long-range dependencies.
- Bias introduced by the reward model or annotator preferences.
- Dependence on the quality and domain coverage of retrieved passages.
Core Entities
Models
- LLaMA-2 (referenced)
- Baichuan (referenced)
- BERT (BertForSequenceClassification)
- SALU (proposed LLM fine-tuned model)
Metrics
- Accuracy
- Unanswerability Precision
- Unanswerability Recall
- Unanswerability F1
- Answerable QA F1
- Exact Match
- Hallucination Rate (%)
- Inference Latency (ms)
- Human Likert scores (Factuality, Fluency, Abstention, Hallucination avoidance)
Datasets
- C-IR Answerability (authors' custom dataset)
- Existing Chinese QA datasets (extended for C-IR)
Benchmarks
- C-IR Answerability test set

