Overview
The system clearly improves factuality on the tested medical queries, but evaluation is small and single-domain so treat results as promising prototype rather than production-grade evidence.
Citations1
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
FactFinder reduces hallucinations and improves completeness for domain questions by combining proprietary graph data with an LLM, making it useful for research teams that need verified, up-to-date facts quickly.
Who Should Care
Summary TLDR
FactFinder is a hybrid QA system that uses an LLM to turn natural questions into Cypher queries, executes them on PrimeKG (a medical knowledge graph), and asks an LLM to verbalize the graph results. On a curated 69-sample text-to-Cypher benchmark the system retrieves correct nodes with ~78% precision and outperforms an LLM-only pipeline in correctness (94.12%) and completeness (96.08%) of answers. The repo, prompts, and dataset are published. The method is a practical prototype for time-sensitive, domain-specific factual queries but is tested only on a small, single-domain dataset.
Problem Statement
Large LLMs can answer natural questions but often lack up-to-date, domain-specific facts and can hallucinate. The paper asks: can we reliably combine an LLM with a knowledge graph to retrieve factual answers for life-science questions and make the system transparent and verifiable?
Main Contribution
A working hybrid QA pipeline that generates Cypher from text, runs queries on PrimeKG, and verbalizes graph results with an LLM.
A manually created 69 text-to-Cypher query pairs dataset for medical questions and released code and prompt templates.
Key Findings
Hybrid KG+LLM retrieval achieved good precision on node retrieval.
GPT-4o produced the strongest text-to-Cypher retrieval performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Text-to-Cypher node retrieval precision (best) | 77.5% (GPT-4o, EE False) | — | — | 69 text-to-Cypher pairs (PrimeKG) | Table 1, Sec.4.1 | Table 1 |
| Hybrid system correctness vs LLM-only | Hybrid more correct in 94.12% of cases | LLM-only | — | 69 question set | Sec.4.2 - Hybrid vs LLM-only | Sec.4.2 |
What To Try In 7 Days
Run the released repo on PrimeKG and reproduce a few example queries.
Compare LLM-only answers to the KG-backed pipeline on your domain questions.
Inspect generated Cypher queries and subgraphs to validate retrieval behavior for critical queries.
Agent Features
Tool Use
Frameworks
Reproducibility
Risks & Boundaries
Limitations
Evaluation uses a small, manually curated 69-sample dataset limiting generalization.
Single KG (PrimeKG) and single domain (medical) tested.
When Not To Use
For high-stakes clinical decisions without expert review.
If you need broad multi-domain coverage beyond the KG's scope.
Failure Modes
LLM uses internal knowledge and ignores KG, producing incorrect mappings.
Generated Cypher is syntactically valid but queries wrong node types due to schema mismatch.

