Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

August 6, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Daniel Steinigen, Roman Teucher, Timm Heine Ruland, Max Rudat, Nicolas Flores-Herr, Peter Fischer, Nikola Milosevic, Christopher Schymura, Angelo Ziletti

Links

Abstract / PDF

Why It Matters For Business

FactFinder reduces hallucinations and improves completeness for domain questions by combining proprietary graph data with an LLM, making it useful for research teams that need verified, up-to-date facts quickly.

Summary TLDR

FactFinder is a hybrid QA system that uses an LLM to turn natural questions into Cypher queries, executes them on PrimeKG (a medical knowledge graph), and asks an LLM to verbalize the graph results. On a curated 69-sample text-to-Cypher benchmark the system retrieves correct nodes with ~78% precision and outperforms an LLM-only pipeline in correctness (94.12%) and completeness (96.08%) of answers. The repo, prompts, and dataset are published. The method is a practical prototype for time-sensitive, domain-specific factual queries but is tested only on a small, single-domain dataset.

Problem Statement

Large LLMs can answer natural questions but often lack up-to-date, domain-specific facts and can hallucinate. The paper asks: can we reliably combine an LLM with a knowledge graph to retrieve factual answers for life-science questions and make the system transparent and verifiable?

Main Contribution

A working hybrid QA pipeline that generates Cypher from text, runs queries on PrimeKG, and verbalizes graph results with an LLM.

A manually created 69 text-to-Cypher query pairs dataset for medical questions and released code and prompt templates.

Evaluation showing state-of-the-art LLMs (GPT-4o/GPT-4-Turbo) can produce useful Cypher queries for medical KG retrieval.

Practical UI and evidence tools (subgraph visualization, Cypher evidence) to increase user trust and inspect results.

Key Findings

Hybrid KG+LLM retrieval achieved good precision on node retrieval.

NumbersPrecision ≈ 78% on the 69-sample dataset

GPT-4o produced the strongest text-to-Cypher retrieval performance.

NumbersGPT-4o precision 77.5%, recall 77.8%, IoU 75.2% (EE False)

The hybrid system gave more accurate and more complete answers than an LLM alone on evaluated questions.

NumbersHybrid more correct in 94.12% and more complete in 96.08% of cases

LLM verbalization of graph results is usually reliable but not perfect.

NumbersVerbalization correctness 89.13%, completeness 80.43%

LLMs can often detect irrelevant or incorrect KG responses and refuse to answer.

NumbersAnswer denied in 94.2% (gpt-4o) and 91.3% (gpt-4-turbo) of injected-wrong-query cases

Results

Text-to-Cypher node retrieval precision (best)

Value77.5% (GPT-4o, EE False)

Hybrid system correctness vs LLM-only

ValueHybrid more correct in 94.12% of cases

BaselineLLM-only

LLM verbalization correctness

Value89.13% correct, 80.43% complete

Detection of irrelevant KG responses (answer denied)

Value94.2% (gpt-4o); 91.3% (gpt-4-turbo)

Who Should Care

What To Try In 7 Days

Run the released repo on PrimeKG and reproduce a few example queries.

Compare LLM-only answers to the KG-backed pipeline on your domain questions.

Inspect generated Cypher queries and subgraphs to validate retrieval behavior for critical queries.

Agent Features

Tool Use

  • Text-to-Cypher generation
  • Neo4j graph queries
  • Entity extraction and mapping
  • LLM verbalization of graph results
  • Subgraph extraction and visualization

Frameworks

  • Langchain
  • Neo4j
  • Streamlit
  • Pyvis

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation uses a small, manually curated 69-sample dataset limiting generalization.
  • Single KG (PrimeKG) and single domain (medical) tested.
  • PrimeKG merges genes and proteins which can mislead entity mapping.
  • LLMs sometimes produce full answers despite irrelevant KG results for counts/booleans/long outputs.

When Not To Use

  • For high-stakes clinical decisions without expert review.
  • If you need broad multi-domain coverage beyond the KG's scope.
  • When a large, validated benchmark is required for claims of general performance.

Failure Modes

  • LLM uses internal knowledge and ignores KG, producing incorrect mappings.
  • Generated Cypher is syntactically valid but queries wrong node types due to schema mismatch.
  • Aggregate, boolean, or long answers may be verbalized even when KG evidence is irrelevant.
  • Entity mapping errors when preferred terms are missing or when child-parent node mappings are imperfect.

Core Entities

Models

  • gpt-4o
  • gpt-4-turbo

Metrics

  • IoU
  • precision
  • recall
  • correctness
  • completeness

Datasets

  • PrimeKG
  • text-to-Cypher 69-pair dataset (authors)

Benchmarks

  • text-to-Cypher 69-pair dataset