Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

August 6, 20247 min

Overview

Decision SnapshotNeeds Validation

The system clearly improves factuality on the tested medical queries, but evaluation is small and single-domain so treat results as promising prototype rather than production-grade evidence.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Daniel Steinigen, Roman Teucher, Timm Heine Ruland, Max Rudat, Nicolas Flores-Herr, Peter Fischer, Nikola Milosevic, Christopher Schymura, Angelo Ziletti

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FactFinder reduces hallucinations and improves completeness for domain questions by combining proprietary graph data with an LLM, making it useful for research teams that need verified, up-to-date facts quickly.

Who Should Care

Summary TLDR

FactFinder is a hybrid QA system that uses an LLM to turn natural questions into Cypher queries, executes them on PrimeKG (a medical knowledge graph), and asks an LLM to verbalize the graph results. On a curated 69-sample text-to-Cypher benchmark the system retrieves correct nodes with ~78% precision and outperforms an LLM-only pipeline in correctness (94.12%) and completeness (96.08%) of answers. The repo, prompts, and dataset are published. The method is a practical prototype for time-sensitive, domain-specific factual queries but is tested only on a small, single-domain dataset.

Problem Statement

Large LLMs can answer natural questions but often lack up-to-date, domain-specific facts and can hallucinate. The paper asks: can we reliably combine an LLM with a knowledge graph to retrieve factual answers for life-science questions and make the system transparent and verifiable?

Main Contribution

A working hybrid QA pipeline that generates Cypher from text, runs queries on PrimeKG, and verbalizes graph results with an LLM.

A manually created 69 text-to-Cypher query pairs dataset for medical questions and released code and prompt templates.

Key Findings

Hybrid KG+LLM retrieval achieved good precision on node retrieval.

NumbersPrecision ≈ 78% on the 69-sample dataset

Practical UseUse the KG retrieval path when factual node-level accuracy matters; expect ~3/4 correct node hits on similar queries.

Evidence RefAbstract; Sec.4.1

GPT-4o produced the strongest text-to-Cypher retrieval performance.

NumbersGPT-4o precision 77.5%, recall 77.8%, IoU 75.2% (EE False)

Practical UsePrompt GPT-4o for Cypher generation when available. Validate generated queries before trusting results.

Evidence RefTable 1 (Sec.4.1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Text-to-Cypher node retrieval precision (best)77.5% (GPT-4o, EE False)69 text-to-Cypher pairs (PrimeKG)Table 1, Sec.4.1Table 1
Hybrid system correctness vs LLM-onlyHybrid more correct in 94.12% of casesLLM-only69 question setSec.4.2 - Hybrid vs LLM-onlySec.4.2

What To Try In 7 Days

Run the released repo on PrimeKG and reproduce a few example queries.

Compare LLM-only answers to the KG-backed pipeline on your domain questions.

Inspect generated Cypher queries and subgraphs to validate retrieval behavior for critical queries.

Agent Features

Tool Use
Text-to-Cypher generationNeo4j graph queriesEntity extraction and mappingLLM verbalization of graph resultsSubgraph extraction and visualization
Frameworks
LangchainNeo4jStreamlitPyvis

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses a small, manually curated 69-sample dataset limiting generalization.

Single KG (PrimeKG) and single domain (medical) tested.

When Not To Use

For high-stakes clinical decisions without expert review.

If you need broad multi-domain coverage beyond the KG's scope.

Failure Modes

LLM uses internal knowledge and ignores KG, producing incorrect mappings.

Generated Cypher is syntactically valid but queries wrong node types due to schema mismatch.

Core Entities

Models

gpt-4ogpt-4-turbo

Metrics

IoUprecisionrecallcorrectnesscompleteness

Datasets

PrimeKGtext-to-Cypher 69-pair dataset (authors)

Benchmarks

text-to-Cypher 69-pair dataset