GraphRAG (Neo4j + Llama‑3) retrieves reported drug side effects with near‑perfect accuracy

Overview

Decision SnapshotNeeds Validation

Results are strong on a curated SIDER subset and show clear gains from structured retrieval, but real‑world noise and unreported events remain untested.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano

Links

Abstract / PDF / Code

Why It Matters For Business

Graph‑backed retrieval plus a small LLM turns a curated safety database into an almost error‑free lookup service for side‑effect presence, cutting clinician search time and reducing misinformation risk.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

This paper builds two retrieval-augmented systems to answer binary questions like “Is X a side effect of Y?” using the SIDER 4.1 drug-side effect database. A vector-based RAG (Pinecone + ada002 embeddings) and a graph-based GraphRAG (Neo4j + Cypher) feed a Llama-3 8B model. On a balanced subset of 19,520 pairs (976 drugs, 3,851 side effects) GraphRAG scored 0.9999 accuracy and RAG with pairwise format scored 0.998, while a standalone Llama-3 8B scored 0.529. Code is available on GitHub.

Problem Statement

Off-the-shelf LLMs hallucinate and lack reliable domain knowledge for pharmacovigilance. Clinicians need fast, accurate answers about whether a drug is known to cause a specific side effect. The paper asks: can retrieval (text or graph) plus a small LLM deliver reliable, binary drug–side-effect retrieval?

Main Contribution

Design and implement two retrieval-augmented pipelines for drug-side-effect lookup: vector RAG and GraphRAG using SIDER 4.1 as the knowledge base.

Show that GraphRAG (Neo4j graph + Cypher) plus Llama-3 8B gives near-perfect binary retrieval on a 19,520-pair balanced test set.

Key Findings

GraphRAG (Neo4j graph + Llama‑3 8B) achieved near‑perfect retrieval accuracy

NumbersAccuracy=0.9999; F1=0.9999; Precision=0.9998; Sens=0.9999; Spec=0.9998

Practical UseFor lookups over a curated side‑effect database, modeling relationships as a graph plus exact retrieval yields almost error‑free YES/NO answers.

Evidence RefResults section; Fig. 3

Data representation strongly affects RAG performance

NumbersRAG Data B accuracy=0.998 vs Data A accuracy=0.886

Practical UseIndex individual drug–side‑effect pairs (pairwise lines) rather than long aggregated lists when building RAG indexes for factual lookups.

Evidence RefResults section; Fig. 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.9999	—	—	Balanced SIDER subset (19,520 pairs)	Fig. 3; Results section	Results section; Fig. 3
F1 (GraphRAG)	0.9999	—	—	Balanced SIDER subset	Fig. 3; Results section	Results section; Fig. 3

What To Try In 7 Days

Index a small, curated drug–side‑effect table as pairwise text (Data Format B) and test RAG similarity retrieval.

Load the same pairs into a simple Neo4j graph and run direct existence queries with Cypher.

Add entity extraction and a binary prompt to a small LLM to compare results quickly.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/diegogalpy/RAGbased-models-for-drug-side-effect-retrieval

Risks & Boundaries

Limitations

Evaluation uses a balanced subset of SIDER 4.1; real-world reports and underreported events are not covered.

System only supports single‑drug queries; no multi‑drug, class, or reverse queries yet.

When Not To Use

When you need to discover novel or unreported adverse events from noisy real‑world data.

For causal inference about whether a drug caused an event rather than documented association.

Failure Modes

Missed new or underreported side effects because SIDER lacks post‑marketing signals.

Entity-recognition errors (drug or side‑effect spelling/variant mismatch) leading to false negatives.

Core Entities

Models

Llama-3 8BChatGPT 3.5ChatGPT 4

Metrics

AccuracyF1precisionsensitivityspecificity

Datasets

SIDER 4.1 (subset of 19,520 pairs)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GraphRAG (Neo4j graph + Llama‑3 8B) achieved near‑perfect retrieval accuracy

Data representation strongly affects RAG performance

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding