GraphRAG (Neo4j + Llama‑3) retrieves reported drug side effects with near‑perfect accuracy

July 18, 20256 min

Overview

Decision SnapshotNeeds Validation

Results are strong on a curated SIDER subset and show clear gains from structured retrieval, but real‑world noise and unreported events remain untested.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano

Links

Abstract / PDF / Code

Why It Matters For Business

Graph‑backed retrieval plus a small LLM turns a curated safety database into an almost error‑free lookup service for side‑effect presence, cutting clinician search time and reducing misinformation risk.

Who Should Care

Summary TLDR

This paper builds two retrieval-augmented systems to answer binary questions like “Is X a side effect of Y?” using the SIDER 4.1 drug-side effect database. A vector-based RAG (Pinecone + ada002 embeddings) and a graph-based GraphRAG (Neo4j + Cypher) feed a Llama-3 8B model. On a balanced subset of 19,520 pairs (976 drugs, 3,851 side effects) GraphRAG scored 0.9999 accuracy and RAG with pairwise format scored 0.998, while a standalone Llama-3 8B scored 0.529. Code is available on GitHub.

Problem Statement

Off-the-shelf LLMs hallucinate and lack reliable domain knowledge for pharmacovigilance. Clinicians need fast, accurate answers about whether a drug is known to cause a specific side effect. The paper asks: can retrieval (text or graph) plus a small LLM deliver reliable, binary drug–side-effect retrieval?

Main Contribution

Design and implement two retrieval-augmented pipelines for drug-side-effect lookup: vector RAG and GraphRAG using SIDER 4.1 as the knowledge base.

Show that GraphRAG (Neo4j graph + Cypher) plus Llama-3 8B gives near-perfect binary retrieval on a 19,520-pair balanced test set.

Key Findings

GraphRAG (Neo4j graph + Llama‑3 8B) achieved near‑perfect retrieval accuracy

NumbersAccuracy=0.9999; F1=0.9999; Precision=0.9998; Sens=0.9999; Spec=0.9998

Practical UseFor lookups over a curated side‑effect database, modeling relationships as a graph plus exact retrieval yields almost error‑free YES/NO answers.

Evidence RefResults section; Fig. 3

Data representation strongly affects RAG performance

NumbersRAG Data B accuracy=0.998 vs Data A accuracy=0.886

Practical UseIndex individual drug–side‑effect pairs (pairwise lines) rather than long aggregated lists when building RAG indexes for factual lookups.

Evidence RefResults section; Fig. 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.9999Balanced SIDER subset (19,520 pairs)Fig. 3; Results sectionResults section; Fig. 3
F1 (GraphRAG)0.9999Balanced SIDER subsetFig. 3; Results sectionResults section; Fig. 3

What To Try In 7 Days

Index a small, curated drug–side‑effect table as pairwise text (Data Format B) and test RAG similarity retrieval.

Load the same pairs into a simple Neo4j graph and run direct existence queries with Cypher.

Add entity extraction and a binary prompt to a small LLM to compare results quickly.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses a balanced subset of SIDER 4.1; real-world reports and underreported events are not covered.

System only supports single‑drug queries; no multi‑drug, class, or reverse queries yet.

When Not To Use

When you need to discover novel or unreported adverse events from noisy real‑world data.

For causal inference about whether a drug caused an event rather than documented association.

Failure Modes

Missed new or underreported side effects because SIDER lacks post‑marketing signals.

Entity-recognition errors (drug or side‑effect spelling/variant mismatch) leading to false negatives.

Core Entities

Models

Llama-3 8BChatGPT 3.5ChatGPT 4

Metrics

AccuracyF1precisionsensitivityspecificity

Datasets

SIDER 4.1 (subset of 19,520 pairs)