Prevent hallucinations by checking whether the model 'knows' concepts before answering

September 6, 20237 min

Overview

Decision SnapshotReady For Pilot

The approach is novel and evaluated on four open models and a new dataset, with both automated and human labels; it is practical but adds inference cost and needs access to model internals for constrained decoding.

Citations10

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Junyu Luo, Cao Xiao, Fenglong Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SELF-FAMILIARITY can reduce incorrect or fabricated outputs by blocking low-familiarity prompts before generation, improving customer trust and reducing downstream fact-checking costs.

Who Should Care

Summary TLDR

This paper introduces SELF-FAMILIARITY, a zero-resource, pre-generation guard that checks whether an LLM is familiar with the concepts in an instruction and withholds answers if familiarity is low. It extracts concepts with NER, asks the model to explain each concept, masks that explanation, and uses constrained beam search to try to regenerate the concept. Per-concept probability scores are weighted and aggregated into an instruction-level familiarity score. Evaluated on four open models and a new Concept-7 dataset, SELF-FAMILIARITY yields substantially higher AUC/accuracy than baselines and flags unfamiliar instructions before the model produces possibly hallucinated text.

Problem Statement

Large language models can confidently produce fabricated facts (hallucinations). Existing detectors work after generation or rely on external knowledge, making them reactive, brittle to prompt style, or unavailable in zero-resource settings. We need a proactive, zero-resource way to stop the model from answering on topics it likely does not know.

Main Contribution

SELF-FAMILIARITY: a zero-resource, pre-generation self-evaluation that flags instructions with unfamiliar concepts to prevent hallucinations.

A three-step pipeline: concept extraction (NER + grouping/filtering), concept guessing (explain then mask and constrained-beam recover), and weighted aggregation by word-frequency importance.

Key Findings

SELF-FAMILIARITY outperforms baselines on hallucinatory-instruction classification.

NumbersVicuna AUC=0.927 vs best baseline 0.872 (Table 2)

Practical UseRun SELF-FAMILIARITY before generation to catch many hallucination-prone prompts and avoid producing wrong answers.

Evidence RefTable 2

Performance is consistent across model styles.

NumbersAUC 0.9180.927 across four tested models (Table 2)

Practical UseMethod transfers well: you can use the same guard for different open models without heavy retuning.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SELF-FAMILIARITY on hallucinatory-instruction classification (Vicuna-13b-v1.3)AUC 0.927, ACC 0.868, F1 0.854, Pearson 0.693Best baseline Sample-BERTScore AUC 0.872 (Table 2)AUC +0.055Concept-7 test setTable 2 main classification resultsTable 2
SELF-FAMILIARITY across models (AUC range)AUC 0.9180.927various baselines with larger varianceConsistently higher AUC vs baselines across modelsConcept-7 test setTable 2 cross-model comparisonTable 2

What To Try In 7 Days

Run the three-step pipeline (NER → explain & mask → constrained beam search) on one model and flag low-familiarity prompts.

Set a threshold using a small set of known concepts and bootstrap intervals as in the paper.

If flagged, either withhold the automated reply or trigger a retrieval step to gather background knowledge before answering.

Agent Features

Tool Use
Constrained beam search for controlled generation

Optimization Features

Infra Optimization
Requires beam search and max-prob decoding; needs GPUs for speed
Inference Optimization
Uses constrained beam search; higher compute at inference

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires constrained beam search and access to generation probabilities; does not work with API-only models that hide decoding control.

Constrained search with beam size 30 and masking increases inference cost and latency.

When Not To Use

Low-latency production paths where beam search is too slow.

Black-box API models without constrained decoding controls.

Failure Modes

False negatives when model is familiar but expresses the concept in a different phrasing that constrained search misses.

False positives when NER splits or filters a concept improperly, changing the evaluated concept.

Core Entities

Models

Vicuna-13b-v1.3Falcon-7b-instructmpt-7b-instructAlpaca-7b

Metrics

AUCACCF1Pearson

Datasets

Concept-7