Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Overview

Decision SnapshotNeeds Validation

GraphEval is simple to add (one LLM call to build triples plus cheap NLI checks) and shows consistent accuracy gains on summarization benchmarks; KG quality and NLI errors are the main risks to reliability.

Citations6

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Hannah Sansford, Nicholas Richardson, Hermina Petric Maretic, Juba Nait Saada

Links

Abstract / PDF / Data

Why It Matters For Business

GraphEval pinpoints which facts in an LLM output are ungrounded and raises automatic detector accuracy, enabling targeted fixes and cheaper, explainable QA for production systems.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

GraphEval converts an LLM's generated text into a small knowledge graph (KG) of triples, then checks each triple against the provided grounding context using off-the-shelf NLI (natural language inference) models. This preprocessing step improves balanced accuracy on three summarization hallucination benchmarks (average gain 6.2 points, SE=1.3), highlights the exact triples that appear ungrounded, and enables a targeted correction method (GraphCorrect) that raises similarity to the original text while fixing many flagged hallucinations.

Problem Statement

Detect whether an LLM output contains any factual inconsistencies relative to the explicit context supplied with the prompt (closed-domain hallucination detection). The aim is binary classification (consistent vs. contains >=1 inconsistency) while also surfacing which pieces of the output are ungrounded.

Main Contribution

GraphEval: represent an LLM output as a knowledge graph (triples), then feed each triple + context to an NLI model to detect hallucinations and return the inconsistent triples for explainability.

Empirical show that adding GraphEval to existing NLI-based detectors improves balanced accuracy on SummEval, QAGS-C, and QAGS-X (weighted avg improvement 6.2 points, SE=1.3).

Key Findings

Adding GraphEval to NLI-based detectors raises balanced accuracy on three summarization benchmarks.

Numbersavg +6.2 balanced-accuracy (SE=1.3) across SummEval, QAGS‑C, QAGS‑X

Practical UseUse GraphEval as a cheap preprocessor to boost off-the-shelf NLI detectors when checking multi-sentence or long outputs.

Evidence RefTable 2; Section 7.4.1

GraphEval + TrueTeacher improved SummEval balanced accuracy from 74.9 to 79.2.

Numbers74.9 -> 79.2 (SummEval, TrueTeacher)

Practical UseIf you use a strong NLI model, GraphEval still gives a measurable accuracy lift on summarization tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	66.0 -> 71.5 (HHEM + GraphEval)	66.0 (HHEM)	+5.5	SummEval	Table 2 HHEM vs HHEM+GraphEval	Table 2
Accuracy	74.9 -> 79.2 (TrueTeacher + GraphEval)	74.9 (TrueTeacher)	+4.3	SummEval	Table 2 TrueTeacher vs TrueTeacher+GraphEval	Table 2

What To Try In 7 Days

Run the paper's KG-construction prompt on several typical LLM outputs to generate triples.

Feed those triples and the context into an available NLI model (e.g., HHEM) and compare detection with/without GraphEval.

Use GraphCorrect on a handful of flagged outputs and measure ROUGE and whether the fixes are acceptable to a human reviewer.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

SummEvalQAGS-CQAGS-XCNN/DailyMailXSum

Risks & Boundaries

Limitations

Focused on closed-domain detection only; open-domain hallucinations are out of scope.

KG construction can lose or mis-extract facts, especially for short or tightly phrased outputs.

When Not To Use

When you need open-domain truth-checking against the web or external knowledge.

For very short outputs where single-sentence checks are cheaper and KG adds little.

Failure Modes

Coreference or entity extraction errors produce wrong triples and false flags.

NLI model misclassification yields false positives or false negatives for triples.

Core Entities

Models

HHEMTRUETrueTeacherClaude 2

Metrics

AccuracyROUGE-1ROUGE-2ROUGE-L

Datasets

SummEvalQAGS-CQAGS-XCNN/DailyMailXSum

Benchmarks

SummEvalQAGS-CQAGS-X

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding GraphEval to NLI-based detectors raises balanced accuracy on three summarization benchmarks.

GraphEval + TrueTeacher improved SummEval balanced accuracy from 74.9 to 79.2.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding