GraphRAG (Neo4j + Llama‑3) retrieves reported drug side effects with near‑perfect accuracy

July 18, 20256 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano

Links

Abstract / PDF

Why It Matters For Business

Graph‑backed retrieval plus a small LLM turns a curated safety database into an almost error‑free lookup service for side‑effect presence, cutting clinician search time and reducing misinformation risk.

Summary TLDR

This paper builds two retrieval-augmented systems to answer binary questions like “Is X a side effect of Y?” using the SIDER 4.1 drug-side effect database. A vector-based RAG (Pinecone + ada002 embeddings) and a graph-based GraphRAG (Neo4j + Cypher) feed a Llama-3 8B model. On a balanced subset of 19,520 pairs (976 drugs, 3,851 side effects) GraphRAG scored 0.9999 accuracy and RAG with pairwise format scored 0.998, while a standalone Llama-3 8B scored 0.529. Code is available on GitHub.

Problem Statement

Off-the-shelf LLMs hallucinate and lack reliable domain knowledge for pharmacovigilance. Clinicians need fast, accurate answers about whether a drug is known to cause a specific side effect. The paper asks: can retrieval (text or graph) plus a small LLM deliver reliable, binary drug–side-effect retrieval?

Main Contribution

Design and implement two retrieval-augmented pipelines for drug-side-effect lookup: vector RAG and GraphRAG using SIDER 4.1 as the knowledge base.

Show that GraphRAG (Neo4j graph + Cypher) plus Llama-3 8B gives near-perfect binary retrieval on a 19,520-pair balanced test set.

Demonstrate that representation choices matter: pairwise text format (Data Format B) vastly outperforms aggregated lists (Data Format A) in RAG.

Key Findings

GraphRAG (Neo4j graph + Llama‑3 8B) achieved near‑perfect retrieval accuracy

NumbersAccuracy=0.9999; F1=0.9999; Precision=0.9998; Sens=0.9999; Spec=0.9998

Data representation strongly affects RAG performance

NumbersRAG Data B accuracy=0.998 vs Data A accuracy=0.886

Standalone Llama‑3 8B performs poorly without retrieval

NumbersAccuracy=0.529; F1=0.164; Sensitivity=0.092

Large hosted LLMs still underperform without domain retrieval

NumbersChatGPT 3.5 ≈0.55, ChatGPT 4 ≈0.63 (subset test)

Evaluation used a balanced, constrained subset of SIDER 4.1

Numbers19,520 pairs; 976 drugs; 3,851 side effects

Results

Accuracy

Value0.9999

F1 (GraphRAG)

Value0.9999

Accuracy

Value0.998

BaselineRAG Data A

Accuracy

Value0.529

Accuracy

Value≈0.55 / ≈0.63

Baselinestandalone Llama-3 8B

Who Should Care

What To Try In 7 Days

Index a small, curated drug–side‑effect table as pairwise text (Data Format B) and test RAG similarity retrieval.

Load the same pairs into a simple Neo4j graph and run direct existence queries with Cypher.

Add entity extraction and a binary prompt to a small LLM to compare results quickly.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses a balanced subset of SIDER 4.1; real-world reports and underreported events are not covered.
  • System only supports single‑drug queries; no multi‑drug, class, or reverse queries yet.
  • LLM output was constrained to binary responses for evaluation rather than richer explanations.
  • Potential mismatches from drug name variants, brand names, or typos are not fully addressed.

When Not To Use

  • When you need to discover novel or unreported adverse events from noisy real‑world data.
  • For causal inference about whether a drug caused an event rather than documented association.
  • When multi‑drug interactions or class‑level summaries are needed.

Failure Modes

  • Missed new or underreported side effects because SIDER lacks post‑marketing signals.
  • Entity-recognition errors (drug or side‑effect spelling/variant mismatch) leading to false negatives.
  • Index or database corruption could produce incorrect existence checks.
  • Binary output hides uncertainty and nuanced evidence found in literature.

Core Entities

Models

  • Llama-3 8B
  • ChatGPT 3.5
  • ChatGPT 4

Metrics

  • Accuracy
  • F1
  • precision
  • sensitivity
  • specificity

Datasets

  • SIDER 4.1 (subset of 19,520 pairs)