HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

February 10, 20267 min

Overview

Decision SnapshotReady For Pilot

The benchmark shows retrieval plus structured KG traversal materially improve multi-hop QA on recent documents; top hybrid systems reach ~66–76% while LLM-only stays near 23–40% on evaluated domains.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.

Who Should Care

Summary TLDR

The authors introduce HybridRAG-Bench, an automated framework that builds time-framed corpora from recent arXiv papers, extracts a knowledge graph (EvoKG), and generates hybrid question-answer pairs grounded in explicit multi-hop reasoning paths. The benchmark is designed to avoid pretraining contamination so correct answers require retrieval and joint text+graph reasoning. On three domains (AI, governance, bio) they show LLM-only prompting scores (~23–40% accuracy), text RAG gives consistent gains (+7–29 points), and hybrid KG-RAG methods (e.g., EvoReasoner, ToG2.0) achieve much higher accuracy (EvoReasoner ≈ 66–76% across setups). They release code and data.

Problem Statement

Current multi-hop benchmarks often overlap with LLM pretraining data, so high scores can reflect memorized facts rather than real retrieval and reasoning. We need a contamination-aware, scalable benchmark that forces models to fetch up-to-date evidence and compose multiple facts across unstructured text and structured graphs.

Main Contribution

HybridRAG-Bench: an automated pipeline that builds time-framed arXiv corpora, aligned text chunks, and domain-specific knowledge graphs to create retrieval-intensive, multi-hop QA.

EvoKG: a document-driven KG extraction and alignment pipeline used to construct hybrid knowledge environments and track provenance.

Key Findings

LLM-only prompting performs poorly on up-to-date, multi-hop questions.

NumbersLLM-only accuracy ≈ 2340% (Table 3)

Practical UseDon't expect base LLMs alone to answer recent multi-hop queries; add retrieval to obtain domain facts.

Evidence RefTable 3, Section 5.2

Text-based retrieval consistently improves accuracy.

NumbersRAG improves accuracy by 729 absolute points (Section 5.2)

Practical UseIntegrate dense retrieval over a fresh corpus to gain tens of percentage points on up-to-date QA.

Evidence RefTable 3, Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≈2340%Arxiv-AI/CY/BIO (Table 3)LLM-only prompting achieves 23–40% across domainsTable 3
RAG improvement over LLM-only+729 percentage pointsLLM-only promptingmost model-domain combos (Table 3)Text-based RAG gains 7–29 pts absolute vs LLM-onlySection 5.2; Table 3

What To Try In 7 Days

Run a small HybridRAG-Bench instantiation on your domain (choose a recent time window) to measure contamination and retrieval needs.

Compare LLM-only vs text RAG vs simple KG-RAG on a 100-question slice to see retrieval gains.

Test EvoKG or a simple entity-extraction pipeline on a sample corpus and measure fact recovery rate.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Framework currently instantiated on arXiv/scientific texts; behavior on web or commercial data may differ.

KG and QA generation use LLMs, so extraction/generation errors can bias difficulty and coverage.

When Not To Use

For tasks that do not require up-to-date or multi-hop relational facts (static trivia).

When you cannot build or store a fresh retrieval corpus for your domain.

Failure Modes

KG extraction misses or normalizes facts incorrectly, lowering recall.

Naïve one-hop KG injection adds noisy facts and reduces accuracy.

Core Entities

Models

EvoReasonerToG2.0ToGRoGCoKPoGR2-KGHippoRAG2.0RAGDeepSeek V3.2Qwen 2.5LLaMA 3.3LLaMA 3.1

Metrics

Accuracyfact recovery ratestandard deviation

Datasets

HybridRAG-Bench (arXiv-AI)HybridRAG-Bench (arXiv-CY)HybridRAG-Bench (arXiv-BIO)CRAGMINE

Benchmarks

HybridRAG-BenchCRAGHotpotQAWebQSPMetaQA