Overview
The benchmark shows retrieval plus structured KG traversal materially improve multi-hop QA on recent documents; top hybrid systems reach ~66–76% while LLM-only stays near 23–40% on evaluated domains.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.
Who Should Care
Summary TLDR
The authors introduce HybridRAG-Bench, an automated framework that builds time-framed corpora from recent arXiv papers, extracts a knowledge graph (EvoKG), and generates hybrid question-answer pairs grounded in explicit multi-hop reasoning paths. The benchmark is designed to avoid pretraining contamination so correct answers require retrieval and joint text+graph reasoning. On three domains (AI, governance, bio) they show LLM-only prompting scores (~23–40% accuracy), text RAG gives consistent gains (+7–29 points), and hybrid KG-RAG methods (e.g., EvoReasoner, ToG2.0) achieve much higher accuracy (EvoReasoner ≈ 66–76% across setups). They release code and data.
Problem Statement
Current multi-hop benchmarks often overlap with LLM pretraining data, so high scores can reflect memorized facts rather than real retrieval and reasoning. We need a contamination-aware, scalable benchmark that forces models to fetch up-to-date evidence and compose multiple facts across unstructured text and structured graphs.
Main Contribution
HybridRAG-Bench: an automated pipeline that builds time-framed arXiv corpora, aligned text chunks, and domain-specific knowledge graphs to create retrieval-intensive, multi-hop QA.
EvoKG: a document-driven KG extraction and alignment pipeline used to construct hybrid knowledge environments and track provenance.
Key Findings
LLM-only prompting performs poorly on up-to-date, multi-hop questions.
Text-based retrieval consistently improves accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ≈23–40% | — | — | Arxiv-AI/CY/BIO (Table 3) | LLM-only prompting achieves 23–40% across domains | Table 3 |
| RAG improvement over LLM-only | +7–29 percentage points | LLM-only prompting | — | most model-domain combos (Table 3) | Text-based RAG gains 7–29 pts absolute vs LLM-only | Section 5.2; Table 3 |
What To Try In 7 Days
Run a small HybridRAG-Bench instantiation on your domain (choose a recent time window) to measure contamination and retrieval needs.
Compare LLM-only vs text RAG vs simple KG-RAG on a 100-question slice to see retrieval gains.
Test EvoKG or a simple entity-extraction pipeline on a sample corpus and measure fact recovery rate.
Reproducibility
Risks & Boundaries
Limitations
Framework currently instantiated on arXiv/scientific texts; behavior on web or commercial data may differ.
KG and QA generation use LLMs, so extraction/generation errors can bias difficulty and coverage.
When Not To Use
For tasks that do not require up-to-date or multi-hop relational facts (static trivia).
When you cannot build or store a fresh retrieval corpus for your domain.
Failure Modes
KG extraction misses or normalizes facts incorrectly, lowering recall.
Naïve one-hop KG injection adds noisy facts and reduces accuracy.

