Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

June 17, 20248 min

Overview

Decision SnapshotNeeds Validation

Approach is practical and low-cost: it uses off-the-shelf LLMs and a reference-free detector, shows consistent gains on Wiki-derived tests and leaderboard tasks, but is demonstrated on Wikipedia topics and only one iteration, so expect tuning per model and domain.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

You can reduce dependence on human labels and third-party data by having models generate and clean their own training pairs; selective filtering keeps corrections focused and preserves existing capabilities, lowering labeling cost and privacy exposure.

Who Should Care

Summary TLDR

The paper presents a low-human-effort self-training pipeline where an LLM generates instructions and labels, then preference-tunes itself only on samples marked as "unknown" by a reference-free consistency check (SelfCheckGPT). Filtering for low-confidence and high-knowledge samples produces a smaller high-signal preference dataset used with Direct Preference Optimization (DPO). Experiments on Wikipedia-derived topics and three model sizes (1.1B, 7B, 13B) show reduced hallucination on held-out Wiki questions and preserved or slightly improved accuracy on out-of-distribution benchmarks. The method needs only one iteration, uses GPT-3.5/GPT-4 for generation and judging, and publicly shares the

Problem Statement

Fine-tuning LLMs needs lots of labeled data and compute. Self-generated data risks hallucinations and noisy training. Naïve preference tuning can degrade earlier knowledge (catastrophic forgetting). The paper asks whether an LLM can safely self-train by detecting its own unknowns and selectively preference-tuning to reduce hallucinations while avoiding forgetting.

Main Contribution

A four-step self-training pipeline: instruction generation, supervised fine-tuning (SFT), preference labeling, and two-stage filtering (consistency + knowledge).

A reference-free knowledge detector based on contradiction scores (SelfCheckGPT / DeBERTa NLI encoder) to mark samples as unknown and to filter preference data.

Key Findings

Self-training with filtering increases truthfulness on Wiki-Test (GPT-4 judged pairwise).

NumbersWiki-Test wins: 1.1B 54% win vs 16% lose; 7B 40.4% win vs 9.9% lose; 13B 36.9% win vs 12% lose

Practical UseIf you want fewer hallucinations on domain documents, self-generate training pairs and preference-tune only on samples flagged as unknown; it's effective across small and mid-sized models.

Evidence RefTable 2; Figure 2; evaluation uses GPT-4 pairwise LLM-Judge

Selective filtering (consistency + knowledge) produces better or similar downstream accuracy than training on the full preference set.

NumbersOpen LLM leaderboard average acc: 7B Ours 53.3% vs w/o filtering 50.9% (+2.4); 13B Ours 56.6% vs w/o 54.9% (+1.7); 1.1B

Practical UseFilter preference pairs before DPO to get better truthfulness while avoiding degradation on unrelated tasks; you may need to tune the knowledge threshold per model size.

Evidence RefTable 1 (accuracy averages across ARC,HellaSwag,TruthfulQA,Winogrande,MMLU)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SFT1.1B: 54% win / 16% lose; 7B: 40.4% win / 9.9% lose; 13B: 36.9% win / 12% loseSFT (instruct-tuned)Improved win rates over SFT on held-out Wiki questionsWiki-Test (200 questions)Table 2; Figure 2Table 2, Figure 2
Accuracy1.1B Ours 42.1%, 7B Ours 53.3%, 13B Ours 56.6%w/o filtering (preference tuning on full set) and SFT1.1B +2.2 vs w/o filtering; 7B +2.4; 13B +1.7ARC,HellaSwag,TruthfulQA,Winogrande,MMLU (mixed)Table 1 (accuracy per task and averages)Table 1

What To Try In 7 Days

Pick 1–2 target topics and sample ~100 documents each from a trusted source (e.g., Wikipedia).

Use an instructed LLM (GPT-3.5) to auto-generate instructions and a few-shot SFT pass to create baseline outputs.

Compute contradiction scores (NLI encoder or SelfCheckGPT) across K=5–10 samples per prompt and flag "unknown" samples where consistency is low and knowledge signal is weak; set τ_

Optimization Features

Training Optimization
Selective preference tuning (DPO) on filtered high-signal pairs

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single-iteration self-training only; no multi-round continual tuning experiments.

Experiments limited to 10 Wikipedia topics; model may have seen some content during pretraining.

When Not To Use

In high-stakes domains without human verification (medical, legal) where hallucinations must be strictly audited.

If you have ample high-quality human-labeled data—human labels may be safer than self-generated pairs.

Failure Modes

DPO degeneration when preferred and dispreferred answers differ only marginally, producing noisy gradients.

False positives in knowledge detection (misclassifying unfamiliar task formatting as unknown) leading to over-filtering.

Core Entities

Models

TinyLlama-1.1BLlama2-7BLlama2-13B

Metrics

LLM-Judge pairwise ranking (GPT-4)Accuracy

Datasets

Wikipedia (source)Wiki-Test (200 questions)SFTD_DPO (preference candidates)D* (filtered preference set)

Benchmarks

Open LLM leaderboardARCHellaSwagTruthfulQAWinograndeMMLU