Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Overview

Decision SnapshotNeeds Validation

The system clearly improves factuality on the tested medical queries, but evaluation is small and single-domain so treat results as promising prototype rather than production-grade evidence.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Daniel Steinigen, Roman Teucher, Timm Heine Ruland, Max Rudat, Nicolas Flores-Herr, Peter Fischer, Nikola Milosevic, Christopher Schymura, Angelo Ziletti

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FactFinder reduces hallucinations and improves completeness for domain questions by combining proprietary graph data with an LLM, making it useful for research teams that need verified, up-to-date facts quickly.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

FactFinder is a hybrid QA system that uses an LLM to turn natural questions into Cypher queries, executes them on PrimeKG (a medical knowledge graph), and asks an LLM to verbalize the graph results. On a curated 69-sample text-to-Cypher benchmark the system retrieves correct nodes with ~78% precision and outperforms an LLM-only pipeline in correctness (94.12%) and completeness (96.08%) of answers. The repo, prompts, and dataset are published. The method is a practical prototype for time-sensitive, domain-specific factual queries but is tested only on a small, single-domain dataset.

Problem Statement

Large LLMs can answer natural questions but often lack up-to-date, domain-specific facts and can hallucinate. The paper asks: can we reliably combine an LLM with a knowledge graph to retrieve factual answers for life-science questions and make the system transparent and verifiable?

Main Contribution

A working hybrid QA pipeline that generates Cypher from text, runs queries on PrimeKG, and verbalizes graph results with an LLM.

A manually created 69 text-to-Cypher query pairs dataset for medical questions and released code and prompt templates.

Key Findings

Hybrid KG+LLM retrieval achieved good precision on node retrieval.

NumbersPrecision ≈ 78% on the 69-sample dataset

Practical UseUse the KG retrieval path when factual node-level accuracy matters; expect ~3/4 correct node hits on similar queries.

Evidence RefAbstract; Sec.4.1

GPT-4o produced the strongest text-to-Cypher retrieval performance.

NumbersGPT-4o precision 77.5%, recall 77.8%, IoU 75.2% (EE False)

Practical UsePrompt GPT-4o for Cypher generation when available. Validate generated queries before trusting results.

Evidence RefTable 1 (Sec.4.1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Text-to-Cypher node retrieval precision (best)	77.5% (GPT-4o, EE False)	—	—	69 text-to-Cypher pairs (PrimeKG)	Table 1, Sec.4.1	Table 1
Hybrid system correctness vs LLM-only	Hybrid more correct in 94.12% of cases	LLM-only	—	69 question set	Sec.4.2 - Hybrid vs LLM-only	Sec.4.2

What To Try In 7 Days

Run the released repo on PrimeKG and reproduce a few example queries.

Compare LLM-only answers to the KG-backed pipeline on your domain questions.

Inspect generated Cypher queries and subgraphs to validate retrieval behavior for critical queries.

Agent Features

Tool Use

Text-to-Cypher generationNeo4j graph queriesEntity extraction and mappingLLM verbalization of graph resultsSubgraph extraction and visualization

Frameworks

LangchainNeo4jStreamlitPyvis

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/chrschy/fact-finder

Data URLs

https://github.com/chrschy/fact-finder

Risks & Boundaries

Limitations

Evaluation uses a small, manually curated 69-sample dataset limiting generalization.

Single KG (PrimeKG) and single domain (medical) tested.

When Not To Use

For high-stakes clinical decisions without expert review.

If you need broad multi-domain coverage beyond the KG's scope.

Failure Modes

LLM uses internal knowledge and ignores KG, producing incorrect mappings.

Generated Cypher is syntactically valid but queries wrong node types due to schema mismatch.

Core Entities

Models

gpt-4ogpt-4-turbo

Metrics

IoUprecisionrecallcorrectnesscompleteness

Datasets

PrimeKGtext-to-Cypher 69-pair dataset (authors)

Benchmarks

text-to-Cypher 69-pair dataset

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hybrid KG+LLM retrieval achieved good precision on node retrieval.

GPT-4o produced the strongest text-to-Cypher retrieval performance.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding

LLMs generate, explain and iteratively fix Cypher queries so non-experts can ask graph databases in plain English

Key finding