HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Overview

Decision SnapshotReady For Pilot

The benchmark shows retrieval plus structured KG traversal materially improve multi-hop QA on recent documents; top hybrid systems reach ~66–76% while LLM-only stays near 23–40% on evaluated domains.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors introduce HybridRAG-Bench, an automated framework that builds time-framed corpora from recent arXiv papers, extracts a knowledge graph (EvoKG), and generates hybrid question-answer pairs grounded in explicit multi-hop reasoning paths. The benchmark is designed to avoid pretraining contamination so correct answers require retrieval and joint text+graph reasoning. On three domains (AI, governance, bio) they show LLM-only prompting scores (~23–40% accuracy), text RAG gives consistent gains (+7–29 points), and hybrid KG-RAG methods (e.g., EvoReasoner, ToG2.0) achieve much higher accuracy (EvoReasoner ≈ 66–76% across setups). They release code and data.

Problem Statement

Current multi-hop benchmarks often overlap with LLM pretraining data, so high scores can reflect memorized facts rather than real retrieval and reasoning. We need a contamination-aware, scalable benchmark that forces models to fetch up-to-date evidence and compose multiple facts across unstructured text and structured graphs.

Main Contribution

HybridRAG-Bench: an automated pipeline that builds time-framed arXiv corpora, aligned text chunks, and domain-specific knowledge graphs to create retrieval-intensive, multi-hop QA.

EvoKG: a document-driven KG extraction and alignment pipeline used to construct hybrid knowledge environments and track provenance.

Key Findings

LLM-only prompting performs poorly on up-to-date, multi-hop questions.

NumbersLLM-only accuracy ≈ 23–40% (Table 3)

Practical UseDon't expect base LLMs alone to answer recent multi-hop queries; add retrieval to obtain domain facts.

Evidence RefTable 3, Section 5.2

Text-based retrieval consistently improves accuracy.

NumbersRAG improves accuracy by 7–29 absolute points (Section 5.2)

Practical UseIntegrate dense retrieval over a fresh corpus to gain tens of percentage points on up-to-date QA.

Evidence RefTable 3, Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈23–40%	—	—	Arxiv-AI/CY/BIO (Table 3)	LLM-only prompting achieves 23–40% across domains	Table 3
RAG improvement over LLM-only	+7–29 percentage points	LLM-only prompting	—	most model-domain combos (Table 3)	Text-based RAG gains 7–29 pts absolute vs LLM-only	Section 5.2; Table 3

What To Try In 7 Days

Run a small HybridRAG-Bench instantiation on your domain (choose a recent time window) to measure contamination and retrieval needs.

Compare LLM-only vs text RAG vs simple KG-RAG on a 100-question slice to see retrieval gains.

Test EvoKG or a simple entity-extraction pipeline on a sample corpus and measure fact recovery rate.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/junhongmit/HybridRAG-Bench

Data URLs

https://github.com/junhongmit/HybridRAG-Bench

Risks & Boundaries

Limitations

Framework currently instantiated on arXiv/scientific texts; behavior on web or commercial data may differ.

KG and QA generation use LLMs, so extraction/generation errors can bias difficulty and coverage.

When Not To Use

For tasks that do not require up-to-date or multi-hop relational facts (static trivia).

When you cannot build or store a fresh retrieval corpus for your domain.

Failure Modes

KG extraction misses or normalizes facts incorrectly, lowering recall.

Naïve one-hop KG injection adds noisy facts and reduces accuracy.

Core Entities

Models

EvoReasonerToG2.0ToGRoGCoKPoGR2-KGHippoRAG2.0RAGDeepSeek V3.2Qwen 2.5LLaMA 3.3LLaMA 3.1

Metrics

Accuracyfact recovery ratestandard deviation

Datasets

HybridRAG-Bench (arXiv-AI)HybridRAG-Bench (arXiv-CY)HybridRAG-Bench (arXiv-BIO)CRAGMINE

Benchmarks

HybridRAG-BenchCRAGHotpotQAWebQSPMetaQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-only prompting performs poorly on up-to-date, multi-hop questions.

Text-based retrieval consistently improves accuracy.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding