Overview
The survey synthesizes many existing methods into a clear framework; techniques vary in maturity but several (sampling+aggregation, self-evaluation prompts, contrastive decoding) are practical quickly.
Citations6
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Self-Feedback techniques can reduce contradictory or hallucinated outputs without large model scale-ups, improving reliability for customer-facing QA and code assistants.
Who Should Care
Summary TLDR
This survey reframes LLM failures (bad reasoning, hallucinations) as problems of internal consistency across three layers: latent (internal states), decoding (token choice), and response (final text). It proposes a compact Self-Feedback framework—Self-Evaluation (measure consistency) + Self-Update (change response or model)—and organizes the literature into lines of work (decoding, latent interventions, iterative refinement, multi-agent debate, distillation, etc.). The paper reports concrete signals and methods (uncertainty/confidence scores, self-critique, contrastive decoding, probing attention heads) and argues Self-Feedback usually improves internal consistency but does not automatically
Problem Statement
Large language models often give inconsistent or confident-but-wrong outputs. These failures arise from sampling variability, weak latent reasoning, and noisy decoding. The paper argues we need a unified view—"internal consistency" across latent, decoding, and response layers—and practical methods that let models evaluate and update themselves without only relying on scale.
Main Contribution
Proposes internal consistency as a single lens for reasoning elevation and hallucination reduction.
Defines a compact Self-Feedback framework: Self-Evaluation (get signals) + Self-Update (rewrite or fine-tune).
Key Findings
Self-consistency-style sampling plus majority voting can raise reasoning accuracy on math benchmarks.
Even strong models can produce internal contradictions during generation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ≈ +17.9% (Self-Consistency) | — | — | GSM8K (reasoning benchmark) | Survey cites Self-Consistency experiments showing ~17.9% improvement | [2] |
| Self-contradiction rate induced | ≈ 15.7% (GPT-4) | — | — | Open-generation/hallucination tests | Survey reports Mündle et al. induced self-contradictions at 15.7% in GPT-4 | [18] |
What To Try In 7 Days
Sample multiple model runs and apply majority or scoring aggregation to see accuracy gains on your QA tasks.
Add a simple self-evaluation prompt (ask model if its answer is True/False) to get a confidence signal.
Test contrastive decoding or logit adjustments on one generation pipeline to reduce obvious hallucinations.
Reproducibility
Risks & Boundaries
Limitations
Focuses on internal consistency only; interactions with retrieval/RAG and external evidence are largely out of scope
Survey emphasizes methods that adopt model-in-the-loop feedback; human-in-loop and hybrid systems are less covered
When Not To Use
For tasks that require fresh external facts not in the training corpus (use RAG/external retrieval instead)
When you cannot afford the compute of sampling multiple runs or multiple agents
Failure Modes
Model self-evaluators give false positives and overconfident pass-throughs
Self-Feedback amplifies erroneous corpus priors if the training data is biased

