Use an LLM to spot its own factual claims and auto-check them against Wikidata to cut hallucinations

November 22, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

10

Authors

Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun

Links

Abstract / PDF

Why It Matters For Business

KGR can reduce factual errors in model outputs, especially for multi-step reasoning tasks, lowering risk in customer-facing answers and automated reporting without retraining large models.

Summary TLDR

The paper introduces KGR, an automated loop that (1) extracts atomic factual claims from an LLM's draft answer, (2) finds related facts in a knowledge graph (Wikidata), (3) verifies claims, and (4) asks the LLM to retrofit its answer. KGR runs the whole cycle with the LLM (few-shot prompts) and chunked KG triples. On three QA benchmarks (SimpleQuestion, Mintaka, HotpotQA) and three LLMs (ChatGPT, text-davinci-003, Vicuna 13B), KGR improves factual scores—especially on complex, multi-hop problems—by systematically checking facts used during reasoning rather than only query-related facts.

Problem Statement

Large language models often state false facts during multi-step reasoning. Previous KG-augmentation only retrieves facts tied to entities in the user query, so it misses false intermediate facts that appear in the model's reasoning. The paper asks: can we automatically extract the model's internal factual claims, verify them against a knowledge graph, and edit responses to reduce hallucinations?

Main Contribution

KGR: a 5-step, LLM-driven pipeline (claim extraction, entity detection, KG retrieval, fact selection, claim verification, retrofitting) that checks and revises model-generated facts.

An implementation that uses only LLM prompts plus Wikidata (no extra supervised models) and supports iterative multi-turn retrofitting.

Empirical evaluation on three QA datasets and three LLMs showing consistent F1 gains on complex reasoning tasks versus baselines including query-based KG retrieval and web-IR edit methods.

Key Findings

KGR raises ChatGPT F1 on Mintaka (complex reasoning) by about 6.2 points over question-relevant KG retrieval (QKR).

NumbersChatGPT Mintaka F1: QKR 54.6 -> KGR 60.8 (+6.2)

KGR yields large gains for text-davinci-003 on open-domain multi-hop HotpotQA: F1 +15.3 points over QKR.

Numberstext-davinci-003 HotpotQA F1: QKR 31.9 -> KGR 47.2 (+15.3)

KGR also improves a compact open model (Vicuna 13B) modestly across datasets.

NumbersVicuna SimpleQuestion F1: QKR 44.0 -> KGR 46.9 (+2.9); HotpotQA F1 +3.0

Entity detection and fact selection are the main failure points in KGR’s pipeline.

NumbersError analysis points to entity detection and fact selection as primary causes of failed revisions (Figure 6, Appendix)

Results

ChatGPT Mintaka F1 (KGR vs QKR)

Value60.8 (KGR)

Baseline54.6 (QKR)

text-davinci-003 HotpotQA F1 (KGR vs QKR)

Value47.2 (KGR)

Baseline31.9 (QKR)

ChatGPT SimpleQuestion F1 (KGR vs QKR)

Value60.7 (KGR)

Baseline60.2 (QKR)

Vicuna-13B SimpleQuestion F1 (KGR vs QKR)

Value46.9 (KGR)

Baseline44.0 (QKR)

Who Should Care

What To Try In 7 Days

Run KGR-style retrofitting on a small sample of your LLM outputs using Wikidata to measure F1 or precision gains.

Add a claim-extraction prompt to your pipeline and log extracted claims to quantify where the model hallucinates.

Test chunk size and retrieved-triple limits to find a cost-accuracy sweet spot for fact selection.

Reproducibility

Data Urls

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on KG coverage: facts not in Wikidata remain unverifiable.
  • Entity detection and fact selection are error-prone and drive most failures.
  • Adds latency because it retrieves and verifies multiple claim-level triples and may iterate.
  • Risk of multi-turn drift if retrieval returns inconsistent or noisy evidence.

When Not To Use

  • Low-latency applications where extra KG checks break SLAs.
  • Domains lacking a structured KG or with mostly private facts.
  • When entity linking precision is too poor for your domain.

Failure Modes

  • Wrong or overly broad entity detection returns irrelevant triples and prevents correct verification.
  • Fact selection includes noisy triples, causing incorrect verification signals.
  • KG contains outdated or conflicting facts, leading to incorrect retrofitting.
  • Claim extraction misses the core fact or mis-parses pronouns, stopping verification.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-0301)
  • text-davinci-003
  • Vicuna-13B

Metrics

  • EM
  • F1

Datasets

  • SimpleQuestion
  • Mintaka
  • HotpotQA

Benchmarks

  • SimpleQuestion
  • Mintaka
  • HotpotQA