A unified survey that frames reasoning and hallucination as internal consistency problems and presents a Self-Feedback framework

Overview

Decision SnapshotNeeds Validation

The survey synthesizes many existing methods into a clear framework; techniques vary in maturity but several (sampling+aggregation, self-evaluation prompts, contrastive decoding) are practical quickly.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, Zhiyu Li

Links

Abstract / PDF / Code

Why It Matters For Business

Self-Feedback techniques can reduce contradictory or hallucinated outputs without large model scale-ups, improving reliability for customer-facing QA and code assistants.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

This survey reframes LLM failures (bad reasoning, hallucinations) as problems of internal consistency across three layers: latent (internal states), decoding (token choice), and response (final text). It proposes a compact Self-Feedback framework—Self-Evaluation (measure consistency) + Self-Update (change response or model)—and organizes the literature into lines of work (decoding, latent interventions, iterative refinement, multi-agent debate, distillation, etc.). The paper reports concrete signals and methods (uncertainty/confidence scores, self-critique, contrastive decoding, probing attention heads) and argues Self-Feedback usually improves internal consistency but does not automatically

Problem Statement

Large language models often give inconsistent or confident-but-wrong outputs. These failures arise from sampling variability, weak latent reasoning, and noisy decoding. The paper argues we need a unified view—"internal consistency" across latent, decoding, and response layers—and practical methods that let models evaluate and update themselves without only relying on scale.

Main Contribution

Proposes internal consistency as a single lens for reasoning elevation and hallucination reduction.

Defines a compact Self-Feedback framework: Self-Evaluation (get signals) + Self-Update (rewrite or fine-tune).

Key Findings

Self-consistency-style sampling plus majority voting can raise reasoning accuracy on math benchmarks.

NumbersGSM8K accuracy up ≈ 17.9%

Practical UseIf you can sample multiple chain-of-thought runs, use aggregation (majority or scoring) to boost QA accuracy on similar datasets.

Evidence Ref[2] Self-Consistency; GSM8K result reported in survey

Even strong models can produce internal contradictions during generation.

NumbersGPT-4 self-contradiction rate ≈ 15.7%

Practical UseDetect self-contradiction via multiple-sample checks to catch hallucinations that single outputs miss.

Evidence Ref[18] Self-Contradict experiments cited in survey

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈ +17.9% (Self-Consistency)	—	—	GSM8K (reasoning benchmark)	Survey cites Self-Consistency experiments showing ~17.9% improvement	[2]
Self-contradiction rate induced	≈ 15.7% (GPT-4)	—	—	Open-generation/hallucination tests	Survey reports Mündle et al. induced self-contradictions at 15.7% in GPT-4	[18]

What To Try In 7 Days

Sample multiple model runs and apply majority or scoring aggregation to see accuracy gains on your QA tasks.

Add a simple self-evaluation prompt (ask model if its answer is True/False) to get a confidence signal.

Test contrastive decoding or logit adjustments on one generation pipeline to reduce obvious hallucinations.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IAAR-Shanghai/ICSFSurvey

Risks & Boundaries

Limitations

Focuses on internal consistency only; interactions with retrieval/RAG and external evidence are largely out of scope

Survey emphasizes methods that adopt model-in-the-loop feedback; human-in-loop and hybrid systems are less covered

When Not To Use

For tasks that require fresh external facts not in the training corpus (use RAG/external retrieval instead)

When you cannot afford the compute of sampling multiple runs or multiple agents

Failure Modes

Model self-evaluators give false positives and overconfident pass-throughs

Self-Feedback amplifies erroneous corpus priors if the training data is biased

Core Entities

Models

GPT-4oGPT-4Llama3-8B-InstructGPT-4-based verifiersteacher/student LLM pairs (e.g., GPT-4 -> student)

Metrics

Accuracyself-contradiction rateexpected calibration error (ECE)Brier scoreconsistency measures (entropy/variance of samples)

Datasets

GSM8KTruthfulQAMMLUHumanEvalMATH

Benchmarks

ConsisEvalLLM-Uncertainty-BenchUBenchCriticBenchBECELPopQA-TP

Context Entities

Models

NLI models used for scoringexternal teacher models used in distillation

Metrics

majority-vote gain (e.g., +17.9% on GSM8K reported for Self-Consistency)self-contradiction percentage (e.g., 15.7%)

Datasets

Consis-style paraphrase setsbenchmarks for uncertainty/IDK

Benchmarks

MMLU, BBH, ARC (common eval benchmarks referenced)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-consistency-style sampling plus majority voting can raise reasoning accuracy on math benchmarks.

Even strong models can produce internal contradictions during generation.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding