Overview
The approach is novel and evaluated on four open models and a new dataset, with both automated and human labels; it is practical but adds inference cost and needs access to model internals for constrained decoding.
Citations10
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
SELF-FAMILIARITY can reduce incorrect or fabricated outputs by blocking low-familiarity prompts before generation, improving customer trust and reducing downstream fact-checking costs.
Who Should Care
Summary TLDR
This paper introduces SELF-FAMILIARITY, a zero-resource, pre-generation guard that checks whether an LLM is familiar with the concepts in an instruction and withholds answers if familiarity is low. It extracts concepts with NER, asks the model to explain each concept, masks that explanation, and uses constrained beam search to try to regenerate the concept. Per-concept probability scores are weighted and aggregated into an instruction-level familiarity score. Evaluated on four open models and a new Concept-7 dataset, SELF-FAMILIARITY yields substantially higher AUC/accuracy than baselines and flags unfamiliar instructions before the model produces possibly hallucinated text.
Problem Statement
Large language models can confidently produce fabricated facts (hallucinations). Existing detectors work after generation or rely on external knowledge, making them reactive, brittle to prompt style, or unavailable in zero-resource settings. We need a proactive, zero-resource way to stop the model from answering on topics it likely does not know.
Main Contribution
SELF-FAMILIARITY: a zero-resource, pre-generation self-evaluation that flags instructions with unfamiliar concepts to prevent hallucinations.
A three-step pipeline: concept extraction (NER + grouping/filtering), concept guessing (explain then mask and constrained-beam recover), and weighted aggregation by word-frequency importance.
Key Findings
SELF-FAMILIARITY outperforms baselines on hallucinatory-instruction classification.
Performance is consistent across model styles.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| SELF-FAMILIARITY on hallucinatory-instruction classification (Vicuna-13b-v1.3) | AUC 0.927, ACC 0.868, F1 0.854, Pearson 0.693 | Best baseline Sample-BERTScore AUC 0.872 (Table 2) | AUC +0.055 | Concept-7 test set | Table 2 main classification results | Table 2 |
| SELF-FAMILIARITY across models (AUC range) | AUC 0.918–0.927 | various baselines with larger variance | Consistently higher AUC vs baselines across models | Concept-7 test set | Table 2 cross-model comparison | Table 2 |
What To Try In 7 Days
Run the three-step pipeline (NER → explain & mask → constrained beam search) on one model and flag low-familiarity prompts.
Set a threshold using a small set of known concepts and bootstrap intervals as in the paper.
If flagged, either withhold the automated reply or trigger a retrieval step to gather background knowledge before answering.
Agent Features
Tool Use
Optimization Features
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires constrained beam search and access to generation probabilities; does not work with API-only models that hide decoding control.
Constrained search with beam size 30 and masking increases inference cost and latency.
When Not To Use
Low-latency production paths where beam search is too slow.
Black-box API models without constrained decoding controls.
Failure Modes
False negatives when model is familiar but expresses the concept in a different phrasing that constrained search misses.
False positives when NER splits or filters a concept improperly, changing the evaluated concept.

