An open-source agent that switches between graph and vector search to improve literature review accuracy

July 30, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Aditya Nagori, Ricardo Accorsi Casonatto, Ayush Gautam, Abhinav Manikantha Sai Cheruvu, Rishikesan Kamaleswaran

Links

Abstract / PDF

Why It Matters For Business

Automating literature review with a system that picks the right retrieval mode reduces manual search time and improves the relevance of extracted evidence. This matters for teams that need fast, evidence-grounded summaries across many papers (R&D, clinical review, IP) and want an auditable pipeline.

Summary TLDR

The authors built and open-sourced an agentic Retrieval-Augmented Generation (RAG) system that stores literature in both a Neo4j knowledge graph and a FAISS vector store, and dynamically picks GraphRAG or VectorRAG per query. Instruction tuning plus Direct Preference Optimization (DPO) improves grounding and retrieval. On a synthetic scientific benchmark the agentic+DPO setup raised vector-store context recall by +0.63 and overall context precision by +0.56 versus a non-agentic baseline. Code is on GitHub.

Problem Statement

Static RAG pipelines (one fixed retrieval path) miss many scientific information needs. Researchers need a system that combines structured metadata (citations, authors) and full-text semantics, picks the right retrieval strategy per question, and reports uncertainty.

Main Contribution

An open-source Python pipeline that ingests PubMed/ArXiv/Google Scholar, builds a Neo4j knowledge graph and a FAISS vector store of full-text chunks.

An agentic orchestration layer (LLaMA-3.3-70B-versatile) that dynamically selects between GraphRAG (Cypher queries) and VectorRAG (BM25 + dense search + reranker) per prompt.

Instruction tuning of the generator (Mistral-7B-Instruct-v0.3) and application of Direct Preference Optimization (DPO) with 15 human preference pairs to improve faithfulness.

A synthetic, balanced benchmark (20 VectorRAG and 20 GraphRAG questions) and bootstrap-based evaluation (12 resamples) reporting uncertainty.

A Dockerizable open-source codebase: https://github.com/Kamaleswaran-Lab/Agentic-Hybrid-Rag

Key Findings

Agentic system with DPO substantially increases vector-store context recall.

NumbersVS Context Recall +0.63 vs baseline

Overall context precision improves meaningfully under agentic+DPO control.

NumbersOverall Context Precision +0.56 vs baseline

DPO boosts faithfulness of generator to retrieved context.

NumbersVS Faithfulness +0.24 vs baseline

Small regressions on some KG metrics were observed after DPO/fine-tuning.

NumbersKG Precision -0.04, KG Faithfulness -0.03 vs baseline

Reported gains include bootstrap uncertainty estimates to assess stability.

NumbersBootstrap n=12 resamples; standard error ≤ 0.10 reported

Results

VS Context Recall

Value+0.63

BaselineNon-agentic RAG

Overall Context Precision

Value+0.56

BaselineNon-agentic RAG

VS Faithfulness

Value+0.24

BaselineNon-agentic RAG

VS Precision

Value+0.12

BaselineNon-agentic RAG

KG Answer Relevance

Value+0.12

BaselineNon-agentic RAG

Overall Faithfulness

Value+0.11

BaselineNon-agentic RAG

KG Context Recall

Value+0.05

BaselineNon-agentic RAG

VS Answer Relevance

Value+0.04

BaselineNon-agentic RAG

Overall Precision

Value+0.04

BaselineNon-agentic RAG

KG Precision

Value-0.04

BaselineNon-agentic RAG

KG Faithfulness

Value-0.03

BaselineNon-agentic RAG

Who Should Care

What To Try In 7 Days

Clone the repo and run the pipeline on a small topic using PubMed/ArXiv API keys to see ingest and KG/VS construction.

Compare answers for a handful of domain questions between a static vector-search pipeline and the agentic pipeline to observe differences in retrieved evidence.

Add 10–20 human preference pairs (DPO style) for your domain to quickly test gains in faithfulness.

Agent Features

Memory

  • Retrieval memory only: Neo4j KG (structured metadata) and FAISS VS (embedded full text)

Planning

  • Dynamic selection of retrieval mode per query
  • Few-shot examples guide NL→Cypher translation and tool choice
  • Decompose user query into tool calls

Tool Use

  • Cypher queries over Neo4j (GraphRAG)
  • BM25 + dense embeddings + FAISS + reranker (VectorRAG)
  • Mistral-7B-Instruct for generation
  • Cohere reranker for passage re-ranking

Frameworks

  • Neo4j
  • FAISS
  • Docker
  • GitHub pipeline

Is Agentic

true

Architectures

  • LLM-based planner (LLaMA-3.3-70B-versatile)
  • Tool-calling workflow (GraphRAG and VectorRAG functions)

Collaboration

  • Supports human-in-the-loop review; encourages oversight for low-confidence outputs

Optimization Features

Token Efficiency

  • Chunking text into 2024-character segments with 50-character overlap (reduces redundant context)

Infra Optimization

  • GPU acceleration reduces latency from ~2 minutes on consumer hardware to ~10 seconds on server GPUs

System Optimization

  • Dockerizable pipeline for reproducible deployment

Training Optimization

  • Instruction tuning of Mistral-7B-Instruct
  • Direct Preference Optimization (DPO) with 15 preference pairs

Inference Optimization

  • Agentic routing to pick the most suitable retriever per query
  • Ensemble retrieval and reranking to prioritize high-quality passages

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation relies on a synthetic benchmark of 40 QA pairs; may not reflect complex real-world literature queries.
  • NL→Cypher translation uses few-shot prompting and can mis-translate complex queries.
  • No OCR pipeline: scanned or image-only PDFs are not handled.
  • Dependence on external APIs (PubMed/ArXiv/Google Scholar) introduces variability and rate limits.

When Not To Use

  • Do not rely on this system where absolute formal guarantees are required (legal/clinical decisions) without human verification.
  • Avoid for corpora composed mostly of scanned PDFs until OCR is integrated.
  • Not ideal if you cannot run GPU-backed servers and need sub-10s latency.

Failure Modes

  • Mistaken Cypher generation leads to wrong or missing KG answers.
  • If retrieval fails (both KG and VS), the generator can hallucinate despite DPO.
  • Reranker biases or limited preference pairs can skew which passages are surfaced.
  • API outages or rate limits stop ingestion and reduce coverage over time.

Core Entities

Models

  • LLaMA-3.3-70B-versatile
  • Mistral-7B-Instruct-v0.3
  • all-MiniLM-L6-v2
  • Cohere rerank-english-v3.0

Metrics

  • Faithfulness
  • Answer relevance
  • Context precision
  • Context recall

Datasets

  • PubMed (via API)
  • ArXiv (via API)
  • Google Scholar (via API)
  • Synthetic RAG benchmark (40 QA pairs)

Benchmarks

  • Custom synthetic VectorRAG/GraphRAG benchmark (20 VS Q, 20 KG Q)