Use an LLM to spot its own factual claims and auto-check them against Wikidata to cut hallucinations

Overview

Decision SnapshotNeeds Validation

KGR is practical: it runs with prompts and a public KG, shows consistent F1 gains on evaluated QA benchmarks, but relies on good entity linking and triple filtering which need engineering work.

Citations10

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun

Links

Abstract / PDF / Data

Why It Matters For Business

KGR can reduce factual errors in model outputs, especially for multi-step reasoning tasks, lowering risk in customer-facing answers and automated reporting without retraining large models.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The paper introduces KGR, an automated loop that (1) extracts atomic factual claims from an LLM's draft answer, (2) finds related facts in a knowledge graph (Wikidata), (3) verifies claims, and (4) asks the LLM to retrofit its answer. KGR runs the whole cycle with the LLM (few-shot prompts) and chunked KG triples. On three QA benchmarks (SimpleQuestion, Mintaka, HotpotQA) and three LLMs (ChatGPT, text-davinci-003, Vicuna 13B), KGR improves factual scores—especially on complex, multi-hop problems—by systematically checking facts used during reasoning rather than only query-related facts.

Problem Statement

Large language models often state false facts during multi-step reasoning. Previous KG-augmentation only retrieves facts tied to entities in the user query, so it misses false intermediate facts that appear in the model's reasoning. The paper asks: can we automatically extract the model's internal factual claims, verify them against a knowledge graph, and edit responses to reduce hallucinations?

Main Contribution

KGR: a 5-step, LLM-driven pipeline (claim extraction, entity detection, KG retrieval, fact selection, claim verification, retrofitting) that checks and revises model-generated facts.

An implementation that uses only LLM prompts plus Wikidata (no extra supervised models) and supports iterative multi-turn retrofitting.

Key Findings

KGR raises ChatGPT F1 on Mintaka (complex reasoning) by about 6.2 points over question-relevant KG retrieval (QKR).

NumbersChatGPT Mintaka F1: QKR 54.6 -> KGR 60.8 (+6.2)

Practical UseIf your use case has multi-step reasoning, retrofit-model answers with KG checks to materially improve factual correctness on evaluated benchmarks.

Evidence RefTable 1

KGR yields large gains for text-davinci-003 on open-domain multi-hop HotpotQA: F1 +15.3 points over QKR.

Numberstext-davinci-003 HotpotQA F1: QKR 31.9 -> KGR 47.2 (+15.3)

Practical UseUsing KG-based retrofitting can correct chained factual errors that web-IR editing misses, giving big accuracy wins on multi-hop questions.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ChatGPT Mintaka F1 (KGR vs QKR)	60.8 (KGR)	54.6 (QKR)	+6.2	Mintaka (complex reasoning)	Table 1 reports EM/F1 for ChatGPT across methods	Table 1
text-davinci-003 HotpotQA F1 (KGR vs QKR)	47.2 (KGR)	31.9 (QKR)	+15.3	HotpotQA (open-domain multi-hop)	Large F1 jump when retrofitting model answers with Wikidata evidence	Table 1

What To Try In 7 Days

Run KGR-style retrofitting on a small sample of your LLM outputs using Wikidata to measure F1 or precision gains.

Add a claim-extraction prompt to your pipeline and log extracted claims to quantify where the model hallucinates.

Test chunk size and retrieved-triple limits to find a cost-accuracy sweet spot for fact selection.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.wikidata.org/https://wikipedia.org/SimpleQuestion, Mintaka, HotpotQA (public datasets referenced in paper)

Risks & Boundaries

Limitations

Relies on KG coverage: facts not in Wikidata remain unverifiable.

Entity detection and fact selection are error-prone and drive most failures.

When Not To Use

Low-latency applications where extra KG checks break SLAs.

Domains lacking a structured KG or with mostly private facts.

Failure Modes

Wrong or overly broad entity detection returns irrelevant triples and prevents correct verification.

Fact selection includes noisy triples, causing incorrect verification signals.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0301)text-davinci-003Vicuna-13B

Metrics

EMF1

Datasets

SimpleQuestionMintakaHotpotQA

Benchmarks

SimpleQuestionMintakaHotpotQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KGR raises ChatGPT F1 on Mintaka (complex reasoning) by about 6.2 points over question-relevant KG retrieval (QKR).

KGR yields large gains for text-davinci-003 on open-domain multi-hop HotpotQA: F1 +15.3 points over QKR.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding