Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Overview

Decision SnapshotNeeds Validation

Approach is practical and low-cost: it uses off-the-shelf LLMs and a reference-free detector, shows consistent gains on Wiki-derived tests and leaderboard tasks, but is demonstrated on Wikipedia topics and only one iteration, so expect tuning per model and domain.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

You can reduce dependence on human labels and third-party data by having models generate and clean their own training pairs; selective filtering keeps corrections focused and preserves existing capabilities, lowering labeling cost and privacy exposure.

Who Should Care

ML Engineer Product Manager Data Scientist Engineering Lead CTO Founder

Summary TLDR

The paper presents a low-human-effort self-training pipeline where an LLM generates instructions and labels, then preference-tunes itself only on samples marked as "unknown" by a reference-free consistency check (SelfCheckGPT). Filtering for low-confidence and high-knowledge samples produces a smaller high-signal preference dataset used with Direct Preference Optimization (DPO). Experiments on Wikipedia-derived topics and three model sizes (1.1B, 7B, 13B) show reduced hallucination on held-out Wiki questions and preserved or slightly improved accuracy on out-of-distribution benchmarks. The method needs only one iteration, uses GPT-3.5/GPT-4 for generation and judging, and publicly shares the

Problem Statement

Fine-tuning LLMs needs lots of labeled data and compute. Self-generated data risks hallucinations and noisy training. Naïve preference tuning can degrade earlier knowledge (catastrophic forgetting). The paper asks whether an LLM can safely self-train by detecting its own unknowns and selectively preference-tuning to reduce hallucinations while avoiding forgetting.

Main Contribution

A four-step self-training pipeline: instruction generation, supervised fine-tuning (SFT), preference labeling, and two-stage filtering (consistency + knowledge).

A reference-free knowledge detector based on contradiction scores (SelfCheckGPT / DeBERTa NLI encoder) to mark samples as unknown and to filter preference data.

Key Findings

Self-training with filtering increases truthfulness on Wiki-Test (GPT-4 judged pairwise).

NumbersWiki-Test wins: 1.1B 54% win vs 16% lose; 7B 40.4% win vs 9.9% lose; 13B 36.9% win vs 12% lose

Practical UseIf you want fewer hallucinations on domain documents, self-generate training pairs and preference-tune only on samples flagged as unknown; it's effective across small and mid-sized models.

Evidence RefTable 2; Figure 2; evaluation uses GPT-4 pairwise LLM-Judge

Selective filtering (consistency + knowledge) produces better or similar downstream accuracy than training on the full preference set.

NumbersOpen LLM leaderboard average acc: 7B Ours 53.3% vs w/o filtering 50.9% (+2.4); 13B Ours 56.6% vs w/o 54.9% (+1.7); 1.1B

Practical UseFilter preference pairs before DPO to get better truthfulness while avoiding degradation on unrelated tasks; you may need to tune the knowledge threshold per model size.

Evidence RefTable 1 (accuracy averages across ARC,HellaSwag,TruthfulQA,Winogrande,MMLU)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SFT	1.1B: 54% win / 16% lose; 7B: 40.4% win / 9.9% lose; 13B: 36.9% win / 12% lose	SFT (instruct-tuned)	Improved win rates over SFT on held-out Wiki questions	Wiki-Test (200 questions)	Table 2; Figure 2	Table 2, Figure 2
Accuracy	1.1B Ours 42.1%, 7B Ours 53.3%, 13B Ours 56.6%	w/o filtering (preference tuning on full set) and SFT	1.1B +2.2 vs w/o filtering; 7B +2.4; 13B +1.7	ARC,HellaSwag,TruthfulQA,Winogrande,MMLU (mixed)	Table 1 (accuracy per task and averages)	Table 1

What To Try In 7 Days

Pick 1–2 target topics and sample ~100 documents each from a trusted source (e.g., Wikipedia).

Use an instructed LLM (GPT-3.5) to auto-generate instructions and a few-shot SFT pass to create baseline outputs.

Compute contradiction scores (NLI encoder or SelfCheckGPT) across K=5–10 samples per prompt and flag "unknown" samples where consistency is low and knowledge signal is weak; set τ_

Optimization Features

Training Optimization

Selective preference tuning (DPO) on filtered high-signal pairs

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/wj210/Self-Training-LLM

Risks & Boundaries

Limitations

Single-iteration self-training only; no multi-round continual tuning experiments.

Experiments limited to 10 Wikipedia topics; model may have seen some content during pretraining.

When Not To Use

In high-stakes domains without human verification (medical, legal) where hallucinations must be strictly audited.

If you have ample high-quality human-labeled data—human labels may be safer than self-generated pairs.

Failure Modes

DPO degeneration when preferred and dispreferred answers differ only marginally, producing noisy gradients.

False positives in knowledge detection (misclassifying unfamiliar task formatting as unknown) leading to over-filtering.

Core Entities

Models

TinyLlama-1.1BLlama2-7BLlama2-13B

Metrics

LLM-Judge pairwise ranking (GPT-4)Accuracy

Datasets

Wikipedia (source)Wiki-Test (200 questions)SFTD_DPO (preference candidates)D* (filtered preference set)

Benchmarks

Open LLM leaderboardARCHellaSwagTruthfulQAWinograndeMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-training with filtering increases truthfulness on Wiki-Test (GPT-4 judged pairwise).

Selective filtering (consistency + knowledge) produces better or similar downstream accuracy than training on the full preference set.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding