Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Automating literature review with a system that picks the right retrieval mode reduces manual search time and improves the relevance of extracted evidence. This matters for teams that need fast, evidence-grounded summaries across many papers (R&D, clinical review, IP) and want an auditable pipeline.
Summary TLDR
The authors built and open-sourced an agentic Retrieval-Augmented Generation (RAG) system that stores literature in both a Neo4j knowledge graph and a FAISS vector store, and dynamically picks GraphRAG or VectorRAG per query. Instruction tuning plus Direct Preference Optimization (DPO) improves grounding and retrieval. On a synthetic scientific benchmark the agentic+DPO setup raised vector-store context recall by +0.63 and overall context precision by +0.56 versus a non-agentic baseline. Code is on GitHub.
Problem Statement
Static RAG pipelines (one fixed retrieval path) miss many scientific information needs. Researchers need a system that combines structured metadata (citations, authors) and full-text semantics, picks the right retrieval strategy per question, and reports uncertainty.
Main Contribution
An open-source Python pipeline that ingests PubMed/ArXiv/Google Scholar, builds a Neo4j knowledge graph and a FAISS vector store of full-text chunks.
An agentic orchestration layer (LLaMA-3.3-70B-versatile) that dynamically selects between GraphRAG (Cypher queries) and VectorRAG (BM25 + dense search + reranker) per prompt.
Instruction tuning of the generator (Mistral-7B-Instruct-v0.3) and application of Direct Preference Optimization (DPO) with 15 human preference pairs to improve faithfulness.
A synthetic, balanced benchmark (20 VectorRAG and 20 GraphRAG questions) and bootstrap-based evaluation (12 resamples) reporting uncertainty.
A Dockerizable open-source codebase: https://github.com/Kamaleswaran-Lab/Agentic-Hybrid-Rag
Key Findings
Agentic system with DPO substantially increases vector-store context recall.
Overall context precision improves meaningfully under agentic+DPO control.
DPO boosts faithfulness of generator to retrieved context.
Small regressions on some KG metrics were observed after DPO/fine-tuning.
Reported gains include bootstrap uncertainty estimates to assess stability.
Results
VS Context Recall
Overall Context Precision
VS Faithfulness
VS Precision
KG Answer Relevance
Overall Faithfulness
KG Context Recall
VS Answer Relevance
Overall Precision
KG Precision
KG Faithfulness
Who Should Care
What To Try In 7 Days
Clone the repo and run the pipeline on a small topic using PubMed/ArXiv API keys to see ingest and KG/VS construction.
Compare answers for a handful of domain questions between a static vector-search pipeline and the agentic pipeline to observe differences in retrieved evidence.
Add 10–20 human preference pairs (DPO style) for your domain to quickly test gains in faithfulness.
Agent Features
Memory
- Retrieval memory only: Neo4j KG (structured metadata) and FAISS VS (embedded full text)
Planning
- Dynamic selection of retrieval mode per query
- Few-shot examples guide NL→Cypher translation and tool choice
- Decompose user query into tool calls
Tool Use
- Cypher queries over Neo4j (GraphRAG)
- BM25 + dense embeddings + FAISS + reranker (VectorRAG)
- Mistral-7B-Instruct for generation
- Cohere reranker for passage re-ranking
Frameworks
- Neo4j
- FAISS
- Docker
- GitHub pipeline
Is Agentic
true
Architectures
- LLM-based planner (LLaMA-3.3-70B-versatile)
- Tool-calling workflow (GraphRAG and VectorRAG functions)
Collaboration
- Supports human-in-the-loop review; encourages oversight for low-confidence outputs
Optimization Features
Token Efficiency
- Chunking text into 2024-character segments with 50-character overlap (reduces redundant context)
Infra Optimization
- GPU acceleration reduces latency from ~2 minutes on consumer hardware to ~10 seconds on server GPUs
System Optimization
- Dockerizable pipeline for reproducible deployment
Training Optimization
- Instruction tuning of Mistral-7B-Instruct
- Direct Preference Optimization (DPO) with 15 preference pairs
Inference Optimization
- Agentic routing to pick the most suitable retriever per query
- Ensemble retrieval and reranking to prioritize high-quality passages
Reproducibility
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation relies on a synthetic benchmark of 40 QA pairs; may not reflect complex real-world literature queries.
- NL→Cypher translation uses few-shot prompting and can mis-translate complex queries.
- No OCR pipeline: scanned or image-only PDFs are not handled.
- Dependence on external APIs (PubMed/ArXiv/Google Scholar) introduces variability and rate limits.
When Not To Use
- Do not rely on this system where absolute formal guarantees are required (legal/clinical decisions) without human verification.
- Avoid for corpora composed mostly of scanned PDFs until OCR is integrated.
- Not ideal if you cannot run GPU-backed servers and need sub-10s latency.
Failure Modes
- Mistaken Cypher generation leads to wrong or missing KG answers.
- If retrieval fails (both KG and VS), the generator can hallucinate despite DPO.
- Reranker biases or limited preference pairs can skew which passages are surfaced.
- API outages or rate limits stop ingestion and reduce coverage over time.
Core Entities
Models
- LLaMA-3.3-70B-versatile
- Mistral-7B-Instruct-v0.3
- all-MiniLM-L6-v2
- Cohere rerank-english-v3.0
Metrics
- Faithfulness
- Answer relevance
- Context precision
- Context recall
Datasets
- PubMed (via API)
- ArXiv (via API)
- Google Scholar (via API)
- Synthetic RAG benchmark (40 QA pairs)
Benchmarks
- Custom synthetic VectorRAG/GraphRAG benchmark (20 VS Q, 20 KG Q)

