Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
FactFinder reduces hallucinations and improves completeness for domain questions by combining proprietary graph data with an LLM, making it useful for research teams that need verified, up-to-date facts quickly.
Summary TLDR
FactFinder is a hybrid QA system that uses an LLM to turn natural questions into Cypher queries, executes them on PrimeKG (a medical knowledge graph), and asks an LLM to verbalize the graph results. On a curated 69-sample text-to-Cypher benchmark the system retrieves correct nodes with ~78% precision and outperforms an LLM-only pipeline in correctness (94.12%) and completeness (96.08%) of answers. The repo, prompts, and dataset are published. The method is a practical prototype for time-sensitive, domain-specific factual queries but is tested only on a small, single-domain dataset.
Problem Statement
Large LLMs can answer natural questions but often lack up-to-date, domain-specific facts and can hallucinate. The paper asks: can we reliably combine an LLM with a knowledge graph to retrieve factual answers for life-science questions and make the system transparent and verifiable?
Main Contribution
A working hybrid QA pipeline that generates Cypher from text, runs queries on PrimeKG, and verbalizes graph results with an LLM.
A manually created 69 text-to-Cypher query pairs dataset for medical questions and released code and prompt templates.
Evaluation showing state-of-the-art LLMs (GPT-4o/GPT-4-Turbo) can produce useful Cypher queries for medical KG retrieval.
Practical UI and evidence tools (subgraph visualization, Cypher evidence) to increase user trust and inspect results.
Key Findings
Hybrid KG+LLM retrieval achieved good precision on node retrieval.
GPT-4o produced the strongest text-to-Cypher retrieval performance.
The hybrid system gave more accurate and more complete answers than an LLM alone on evaluated questions.
LLM verbalization of graph results is usually reliable but not perfect.
LLMs can often detect irrelevant or incorrect KG responses and refuse to answer.
Results
Text-to-Cypher node retrieval precision (best)
Hybrid system correctness vs LLM-only
LLM verbalization correctness
Detection of irrelevant KG responses (answer denied)
Who Should Care
What To Try In 7 Days
Run the released repo on PrimeKG and reproduce a few example queries.
Compare LLM-only answers to the KG-backed pipeline on your domain questions.
Inspect generated Cypher queries and subgraphs to validate retrieval behavior for critical queries.
Agent Features
Tool Use
- Text-to-Cypher generation
- Neo4j graph queries
- Entity extraction and mapping
- LLM verbalization of graph results
- Subgraph extraction and visualization
Frameworks
- Langchain
- Neo4j
- Streamlit
- Pyvis
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation uses a small, manually curated 69-sample dataset limiting generalization.
- Single KG (PrimeKG) and single domain (medical) tested.
- PrimeKG merges genes and proteins which can mislead entity mapping.
- LLMs sometimes produce full answers despite irrelevant KG results for counts/booleans/long outputs.
When Not To Use
- For high-stakes clinical decisions without expert review.
- If you need broad multi-domain coverage beyond the KG's scope.
- When a large, validated benchmark is required for claims of general performance.
Failure Modes
- LLM uses internal knowledge and ignores KG, producing incorrect mappings.
- Generated Cypher is syntactically valid but queries wrong node types due to schema mismatch.
- Aggregate, boolean, or long answers may be verbalized even when KG evidence is irrelevant.
- Entity mapping errors when preferred terms are missing or when child-parent node mappings are imperfect.
Core Entities
Models
- gpt-4o
- gpt-4-turbo
Metrics
- IoU
- precision
- recall
- correctness
- completeness
Datasets
- PrimeKG
- text-to-Cypher 69-pair dataset (authors)
Benchmarks
- text-to-Cypher 69-pair dataset

