Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.
Summary TLDR
The authors introduce HybridRAG-Bench, an automated framework that builds time-framed corpora from recent arXiv papers, extracts a knowledge graph (EvoKG), and generates hybrid question-answer pairs grounded in explicit multi-hop reasoning paths. The benchmark is designed to avoid pretraining contamination so correct answers require retrieval and joint text+graph reasoning. On three domains (AI, governance, bio) they show LLM-only prompting scores (~23–40% accuracy), text RAG gives consistent gains (+7–29 points), and hybrid KG-RAG methods (e.g., EvoReasoner, ToG2.0) achieve much higher accuracy (EvoReasoner ≈ 66–76% across setups). They release code and data.
Problem Statement
Current multi-hop benchmarks often overlap with LLM pretraining data, so high scores can reflect memorized facts rather than real retrieval and reasoning. We need a contamination-aware, scalable benchmark that forces models to fetch up-to-date evidence and compose multiple facts across unstructured text and structured graphs.
Main Contribution
HybridRAG-Bench: an automated pipeline that builds time-framed arXiv corpora, aligned text chunks, and domain-specific knowledge graphs to create retrieval-intensive, multi-hop QA.
EvoKG: a document-driven KG extraction and alignment pipeline used to construct hybrid knowledge environments and track provenance.
Hybrid QA generation: LLM-conditioned question generation grounded on explicit graph paths and supporting text, plus automated quality control using LLM-as-a-judge.
Empirical evaluation across three domains (AI, governance/policy, bio) showing the benchmark rewards genuine retrieval and hybrid reasoning and discriminates methods.
Open release of code and data for contamination-aware benchmarking.
Key Findings
LLM-only prompting performs poorly on up-to-date, multi-hop questions.
Text-based retrieval consistently improves accuracy.
Hybrid KG+text methods outperform text-only RAG on relational multi-hop tasks.
The KG construction pipeline recovers most verifiable facts from source documents.
Naïvely injecting local KG facts can hurt performance.
Results
Accuracy
RAG improvement over LLM-only
Accuracy
KG fact recovery
Who Should Care
What To Try In 7 Days
Run a small HybridRAG-Bench instantiation on your domain (choose a recent time window) to measure contamination and retrieval needs.
Compare LLM-only vs text RAG vs simple KG-RAG on a 100-question slice to see retrieval gains.
Test EvoKG or a simple entity-extraction pipeline on a sample corpus and measure fact recovery rate.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Framework currently instantiated on arXiv/scientific texts; behavior on web or commercial data may differ.
- KG and QA generation use LLMs, so extraction/generation errors can bias difficulty and coverage.
- Evaluation uses LLM-as-judge, which can introduce judge bias or false positives/negatives.
When Not To Use
- For tasks that do not require up-to-date or multi-hop relational facts (static trivia).
- When you cannot build or store a fresh retrieval corpus for your domain.
- If manual curation or human-grounded evaluation is required for safety-critical outputs.
Failure Modes
- KG extraction misses or normalizes facts incorrectly, lowering recall.
- Naïve one-hop KG injection adds noisy facts and reduces accuracy.
- LLM-as-judge may mis-evaluate subtle or ambiguous answers.
Core Entities
Models
- EvoReasoner
- ToG2.0
- ToG
- RoG
- CoK
- PoG
- R2-KG
- HippoRAG2.0
- RAG
- DeepSeek V3.2
- Qwen 2.5
- LLaMA 3.3
- LLaMA 3.1
Metrics
- Accuracy
- fact recovery rate
- standard deviation
Datasets
- HybridRAG-Bench (arXiv-AI)
- HybridRAG-Bench (arXiv-CY)
- HybridRAG-Bench (arXiv-BIO)
- CRAG
- MINE
Benchmarks
- HybridRAG-Bench
- CRAG
- HotpotQA
- WebQSP
- MetaQA

