Overview
GraphEval is simple to add (one LLM call to build triples plus cheap NLI checks) and shows consistent accuracy gains on summarization benchmarks; KG quality and NLI errors are the main risks to reliability.
Citations6
Evidence Strength0.75
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
GraphEval pinpoints which facts in an LLM output are ungrounded and raises automatic detector accuracy, enabling targeted fixes and cheaper, explainable QA for production systems.
Who Should Care
Summary TLDR
GraphEval converts an LLM's generated text into a small knowledge graph (KG) of triples, then checks each triple against the provided grounding context using off-the-shelf NLI (natural language inference) models. This preprocessing step improves balanced accuracy on three summarization hallucination benchmarks (average gain 6.2 points, SE=1.3), highlights the exact triples that appear ungrounded, and enables a targeted correction method (GraphCorrect) that raises similarity to the original text while fixing many flagged hallucinations.
Problem Statement
Detect whether an LLM output contains any factual inconsistencies relative to the explicit context supplied with the prompt (closed-domain hallucination detection). The aim is binary classification (consistent vs. contains >=1 inconsistency) while also surfacing which pieces of the output are ungrounded.
Main Contribution
GraphEval: represent an LLM output as a knowledge graph (triples), then feed each triple + context to an NLI model to detect hallucinations and return the inconsistent triples for explainability.
Empirical show that adding GraphEval to existing NLI-based detectors improves balanced accuracy on SummEval, QAGS-C, and QAGS-X (weighted avg improvement 6.2 points, SE=1.3).
Key Findings
Adding GraphEval to NLI-based detectors raises balanced accuracy on three summarization benchmarks.
GraphEval + TrueTeacher improved SummEval balanced accuracy from 74.9 to 79.2.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 66.0 -> 71.5 (HHEM + GraphEval) | 66.0 (HHEM) | +5.5 | SummEval | Table 2 HHEM vs HHEM+GraphEval | Table 2 |
| Accuracy | 74.9 -> 79.2 (TrueTeacher + GraphEval) | 74.9 (TrueTeacher) | +4.3 | SummEval | Table 2 TrueTeacher vs TrueTeacher+GraphEval | Table 2 |
What To Try In 7 Days
Run the paper's KG-construction prompt on several typical LLM outputs to generate triples.
Feed those triples and the context into an available NLI model (e.g., HHEM) and compare detection with/without GraphEval.
Use GraphCorrect on a handful of flagged outputs and measure ROUGE and whether the fixes are acceptable to a human reviewer.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Focused on closed-domain detection only; open-domain hallucinations are out of scope.
KG construction can lose or mis-extract facts, especially for short or tightly phrased outputs.
When Not To Use
When you need open-domain truth-checking against the web or external knowledge.
For very short outputs where single-sentence checks are cheaper and KG adds little.
Failure Modes
Coreference or entity extraction errors produce wrong triples and false flags.
NLI model misclassification yields false positives or false negatives for triples.

