Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

July 15, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Hannah Sansford, Nicholas Richardson, Hermina Petric Maretic, Juba Nait Saada

Links

Abstract / PDF

Why It Matters For Business

GraphEval pinpoints which facts in an LLM output are ungrounded and raises automatic detector accuracy, enabling targeted fixes and cheaper, explainable QA for production systems.

Summary TLDR

GraphEval converts an LLM's generated text into a small knowledge graph (KG) of triples, then checks each triple against the provided grounding context using off-the-shelf NLI (natural language inference) models. This preprocessing step improves balanced accuracy on three summarization hallucination benchmarks (average gain 6.2 points, SE=1.3), highlights the exact triples that appear ungrounded, and enables a targeted correction method (GraphCorrect) that raises similarity to the original text while fixing many flagged hallucinations.

Problem Statement

Detect whether an LLM output contains any factual inconsistencies relative to the explicit context supplied with the prompt (closed-domain hallucination detection). The aim is binary classification (consistent vs. contains >=1 inconsistency) while also surfacing which pieces of the output are ungrounded.

Main Contribution

GraphEval: represent an LLM output as a knowledge graph (triples), then feed each triple + context to an NLI model to detect hallucinations and return the inconsistent triples for explainability.

Empirical show that adding GraphEval to existing NLI-based detectors improves balanced accuracy on SummEval, QAGS-C, and QAGS-X (weighted avg improvement 6.2 points, SE=1.3).

GraphCorrect: a two-step LLM-based correction pipeline that replaces only flagged triples, producing corrected outputs that keep high similarity to the original text and correct many hallucinations.

Key Findings

Adding GraphEval to NLI-based detectors raises balanced accuracy on three summarization benchmarks.

Numbersavg +6.2 balanced-accuracy (SE=1.3) across SummEval, QAGS‑C, QAGS‑X

GraphEval + TrueTeacher improved SummEval balanced accuracy from 74.9 to 79.2.

Numbers74.9 -> 79.2 (SummEval, TrueTeacher)

GraphCorrect produces corrected summaries that are closer to the original outputs than a simple direct prompt baseline.

NumbersHHEM+GraphEval SummEval ROUGE‑1: 0.827 -> 0.915 (GraphCorrect)

GraphCorrect often yields a higher percent of believed corrected hallucinations than a direct-prompt baseline.

NumbersHHEM+GraphEval QAGS‑C: 38.5% -> 58.7% believed corrected

Results

Accuracy

Value66.0 -> 71.5 (HHEM + GraphEval)

Baseline66.0 (HHEM)

Accuracy

Value74.9 -> 79.2 (TrueTeacher + GraphEval)

Baseline74.9 (TrueTeacher)

Accuracy

Valueavg +6.2 balanced-accuracy (weighted by dataset size)

ROUGE-1 similarity after correction (HHEM+GraphEval, SummEval)

ValueDirect Prompt 0.827 -> GraphCorrect 0.915

Baseline0.827 (Direct Prompt)

believed corrected hallucinations (HHEM+GraphEval, QAGS-C)

ValueDirect Prompt 38.5% -> GraphCorrect 58.7%

Baseline38.5% (Direct Prompt)

Who Should Care

What To Try In 7 Days

Run the paper's KG-construction prompt on several typical LLM outputs to generate triples.

Feed those triples and the context into an available NLI model (e.g., HHEM) and compare detection with/without GraphEval.

Use GraphCorrect on a handful of flagged outputs and measure ROUGE and whether the fixes are acceptable to a human reviewer.

Reproducibility

Data Urls

  • SummEval
  • QAGS-C
  • QAGS-X
  • CNN/DailyMail
  • XSum

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focused on closed-domain detection only; open-domain hallucinations are out of scope.
  • KG construction can lose or mis-extract facts, especially for short or tightly phrased outputs.
  • Relies on external NLI detectors which themselves produce errors and bias judgments.
  • Evaluations are limited to summarization benchmarks (SummEval, QAGS‑C, QAGS‑X).

When Not To Use

  • When you need open-domain truth-checking against the web or external knowledge.
  • For very short outputs where single-sentence checks are cheaper and KG adds little.
  • If you cannot run a single LLM call for KG construction due to latency or cost constraints.

Failure Modes

  • Coreference or entity extraction errors produce wrong triples and false flags.
  • NLI model misclassification yields false positives or false negatives for triples.
  • GraphCorrect may introduce subtle meaning changes if the corrected triple is imperfect.
  • Short outputs can be degraded by KG construction, reducing detection signal.

Core Entities

Models

  • HHEM
  • TRUE
  • TrueTeacher
  • Claude 2

Metrics

  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L

Datasets

  • SummEval
  • QAGS-C
  • QAGS-X
  • CNN/DailyMail
  • XSum

Benchmarks

  • SummEval
  • QAGS-C
  • QAGS-X