Prevent hallucinations by checking whether the model 'knows' concepts before answering

September 6, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

10

Authors

Junyu Luo, Cao Xiao, Fenglong Ma

Links

Abstract / PDF

Why It Matters For Business

SELF-FAMILIARITY can reduce incorrect or fabricated outputs by blocking low-familiarity prompts before generation, improving customer trust and reducing downstream fact-checking costs.

Summary TLDR

This paper introduces SELF-FAMILIARITY, a zero-resource, pre-generation guard that checks whether an LLM is familiar with the concepts in an instruction and withholds answers if familiarity is low. It extracts concepts with NER, asks the model to explain each concept, masks that explanation, and uses constrained beam search to try to regenerate the concept. Per-concept probability scores are weighted and aggregated into an instruction-level familiarity score. Evaluated on four open models and a new Concept-7 dataset, SELF-FAMILIARITY yields substantially higher AUC/accuracy than baselines and flags unfamiliar instructions before the model produces possibly hallucinated text.

Problem Statement

Large language models can confidently produce fabricated facts (hallucinations). Existing detectors work after generation or rely on external knowledge, making them reactive, brittle to prompt style, or unavailable in zero-resource settings. We need a proactive, zero-resource way to stop the model from answering on topics it likely does not know.

Main Contribution

SELF-FAMILIARITY: a zero-resource, pre-generation self-evaluation that flags instructions with unfamiliar concepts to prevent hallucinations.

A three-step pipeline: concept extraction (NER + grouping/filtering), concept guessing (explain then mask and constrained-beam recover), and weighted aggregation by word-frequency importance.

A new evaluation dataset, Concept-7 (192 basic concepts; 515 test instructions) and experiments on four open LMs showing consistent gains vs parameter- and prompt-based baselines.

Human and GPT-4 annotations used to build familiarity labels and thresholds; ablations show each processing step helps.

Key Findings

SELF-FAMILIARITY outperforms baselines on hallucinatory-instruction classification.

NumbersVicuna AUC=0.927 vs best baseline 0.872 (Table 2)

Performance is consistent across model styles.

NumbersAUC 0.918–0.927 across four tested models (Table 2)

Concept Guessing alone is very strong in controlled settings.

NumbersConcept-only Vicuna AUC=0.966, ACC=0.928, F1=0.921 (Table 5)

Results

SELF-FAMILIARITY on hallucinatory-instruction classification (Vicuna-13b-v1.3)

ValueAUC 0.927, ACC 0.868, F1 0.854, Pearson 0.693

BaselineBest baseline Sample-BERTScore AUC 0.872 (Table 2)

SELF-FAMILIARITY across models (AUC range)

ValueAUC 0.918–0.927

Baselinevarious baselines with larger variance

Concept-only evaluation (Vicuna-13b-v1.3)

ValueAUC 0.966, ACC 0.928, F1 0.921, Pearson 0.844

BaselineSample-BERTScore AUC 0.920 (Table 5)

Who Should Care

What To Try In 7 Days

Run the three-step pipeline (NER → explain & mask → constrained beam search) on one model and flag low-familiarity prompts.

Set a threshold using a small set of known concepts and bootstrap intervals as in the paper.

If flagged, either withhold the automated reply or trigger a retrieval step to gather background knowledge before answering.

Agent Features

Tool Use

  • Constrained beam search for controlled generation

Optimization Features

Infra Optimization

  • Requires beam search and max-prob decoding; needs GPUs for speed

Inference Optimization

  • Uses constrained beam search; higher compute at inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires constrained beam search and access to generation probabilities; does not work with API-only models that hide decoding control.
  • Constrained search with beam size 30 and masking increases inference cost and latency.
  • NER + Wiktionary heuristics can miss or mis-group concepts in noisy instructions.
  • Concept-7 is limited in size and uses fabricated concepts to balance unfamiliar examples.

When Not To Use

  • Low-latency production paths where beam search is too slow.
  • Black-box API models without constrained decoding controls.
  • Use-cases that require full external knowledge access rather than internal familiarity checks.

Failure Modes

  • False negatives when model is familiar but expresses the concept in a different phrasing that constrained search misses.
  • False positives when NER splits or filters a concept improperly, changing the evaluated concept.
  • Flags unfamiliar but relevant domain concepts that could be answered after a quick retrieval step; the guard needs integration with retrieval to recover.

Core Entities

Models

  • Vicuna-13b-v1.3
  • Falcon-7b-instruct
  • mpt-7b-instruct
  • Alpaca-7b

Metrics

  • AUC
  • ACC
  • F1
  • Pearson

Datasets

  • Concept-7