Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

June 17, 20248 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria

Links

Abstract / PDF

Why It Matters For Business

You can reduce dependence on human labels and third-party data by having models generate and clean their own training pairs; selective filtering keeps corrections focused and preserves existing capabilities, lowering labeling cost and privacy exposure.

Summary TLDR

The paper presents a low-human-effort self-training pipeline where an LLM generates instructions and labels, then preference-tunes itself only on samples marked as "unknown" by a reference-free consistency check (SelfCheckGPT). Filtering for low-confidence and high-knowledge samples produces a smaller high-signal preference dataset used with Direct Preference Optimization (DPO). Experiments on Wikipedia-derived topics and three model sizes (1.1B, 7B, 13B) show reduced hallucination on held-out Wiki questions and preserved or slightly improved accuracy on out-of-distribution benchmarks. The method needs only one iteration, uses GPT-3.5/GPT-4 for generation and judging, and publicly shares the

Problem Statement

Fine-tuning LLMs needs lots of labeled data and compute. Self-generated data risks hallucinations and noisy training. Naïve preference tuning can degrade earlier knowledge (catastrophic forgetting). The paper asks whether an LLM can safely self-train by detecting its own unknowns and selectively preference-tuning to reduce hallucinations while avoiding forgetting.

Main Contribution

A four-step self-training pipeline: instruction generation, supervised fine-tuning (SFT), preference labeling, and two-stage filtering (consistency + knowledge).

A reference-free knowledge detector based on contradiction scores (SelfCheckGPT / DeBERTa NLI encoder) to mark samples as unknown and to filter preference data.

Showed that DPO on filtered, high-signal preference pairs reduces hallucination on Wiki-based held-out questions and helps retain performance on external benchmarks versus unfiltered preference tuning.

Key Findings

Self-training with filtering increases truthfulness on Wiki-Test (GPT-4 judged pairwise).

NumbersWiki-Test wins: 1.1B 54% win vs 16% lose; 7B 40.4% win vs 9.9% lose; 13B 36.9% win vs 12% lose

Selective filtering (consistency + knowledge) produces better or similar downstream accuracy than training on the full preference set.

NumbersOpen LLM leaderboard average acc: 7B Ours 53.3% vs w/o filtering 50.9% (+2.4); 13B Ours 56.6% vs w/o 54.9% (+1.7); 1.1B

Raising the knowledge threshold τ_K shrinks the filtered set but raises win rate.

NumbersD* sizes at τ_K=0.5: 1.1B 4172, 7B 2379, 13B 2234 (original 5780); win rate increases with τ_K (Figure 4).

Including the document context when creating the preferred response improves preference tuning.

NumbersWin rates with document vs without: 7B w doc 40.4% vs w/o doc 25.4%; 13B w doc 36.9% vs w/o doc 27.5%

Results

SFT

Value1.1B: 54% win / 16% lose; 7B: 40.4% win / 9.9% lose; 13B: 36.9% win / 12% lose

BaselineSFT (instruct-tuned)

Accuracy

Value1.1B Ours 42.1%, 7B Ours 53.3%, 13B Ours 56.6%

Baselinew/o filtering (preference tuning on full set) and SFT

Filtered dataset size D* (after τ_L=0.5 consistency and varying τ_K)

ValueOriginal 5780 → τ_K=0.5: 1.1B 4172, 7B 2379, 13B 2234

BaselineNo filtering (5780)

Preference labeling ablation (with vs without document)

Value7B w doc win 40.4% vs w/o doc 25.4%; 13B w doc 36.9% vs w/o doc 27.5%

BaselinePreference dataset constructed without document

Who Should Care

What To Try In 7 Days

Pick 1–2 target topics and sample ~100 documents each from a trusted source (e.g., Wikipedia).

Use an instructed LLM (GPT-3.5) to auto-generate instructions and a few-shot SFT pass to create baseline outputs.

Compute contradiction scores (NLI encoder or SelfCheckGPT) across K=5–10 samples per prompt and flag "unknown" samples where consistency is low and knowledge signal is weak; set τ_

Optimization Features

Training Optimization

  • Selective preference tuning (DPO) on filtered high-signal pairs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-iteration self-training only; no multi-round continual tuning experiments.
  • Experiments limited to 10 Wikipedia topics; model may have seen some content during pretraining.
  • Evaluation uses GPT-4 as judge which may introduce judge biases.
  • Knowledge detector and thresholds (τ_L, τ_K) need per-model tuning; default 0.5 may not generalize.

When Not To Use

  • In high-stakes domains without human verification (medical, legal) where hallucinations must be strictly audited.
  • If you have ample high-quality human-labeled data—human labels may be safer than self-generated pairs.
  • When the knowledge source is noisy or ambiguous; the method assumes a mostly reliable reference.

Failure Modes

  • DPO degeneration when preferred and dispreferred answers differ only marginally, producing noisy gradients.
  • False positives in knowledge detection (misclassifying unfamiliar task formatting as unknown) leading to over-filtering.
  • Judge bias from GPT-4 evaluations could overstate gains if judge preferences differ from human users.
  • Pretraining leakage: if documents are in the model's pretraining data, measured gains may overestimate generalization.

Core Entities

Models

  • TinyLlama-1.1B
  • Llama2-7B
  • Llama2-13B

Metrics

  • LLM-Judge pairwise ranking (GPT-4)
  • Accuracy

Datasets

  • Wikipedia (source)
  • Wiki-Test (200 questions)
  • SFT
  • D_DPO (preference candidates)
  • D* (filtered preference set)

Benchmarks

  • Open LLM leaderboard
  • ARC
  • HellaSwag
  • TruthfulQA
  • Winogrande
  • MMLU