HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

February 10, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu

Links

Abstract / PDF

Why It Matters For Business

If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.

Summary TLDR

The authors introduce HybridRAG-Bench, an automated framework that builds time-framed corpora from recent arXiv papers, extracts a knowledge graph (EvoKG), and generates hybrid question-answer pairs grounded in explicit multi-hop reasoning paths. The benchmark is designed to avoid pretraining contamination so correct answers require retrieval and joint text+graph reasoning. On three domains (AI, governance, bio) they show LLM-only prompting scores (~23–40% accuracy), text RAG gives consistent gains (+7–29 points), and hybrid KG-RAG methods (e.g., EvoReasoner, ToG2.0) achieve much higher accuracy (EvoReasoner ≈ 66–76% across setups). They release code and data.

Problem Statement

Current multi-hop benchmarks often overlap with LLM pretraining data, so high scores can reflect memorized facts rather than real retrieval and reasoning. We need a contamination-aware, scalable benchmark that forces models to fetch up-to-date evidence and compose multiple facts across unstructured text and structured graphs.

Main Contribution

HybridRAG-Bench: an automated pipeline that builds time-framed arXiv corpora, aligned text chunks, and domain-specific knowledge graphs to create retrieval-intensive, multi-hop QA.

EvoKG: a document-driven KG extraction and alignment pipeline used to construct hybrid knowledge environments and track provenance.

Hybrid QA generation: LLM-conditioned question generation grounded on explicit graph paths and supporting text, plus automated quality control using LLM-as-a-judge.

Empirical evaluation across three domains (AI, governance/policy, bio) showing the benchmark rewards genuine retrieval and hybrid reasoning and discriminates methods.

Open release of code and data for contamination-aware benchmarking.

Key Findings

LLM-only prompting performs poorly on up-to-date, multi-hop questions.

NumbersLLM-only accuracy ≈ 23–40% (Table 3)

Text-based retrieval consistently improves accuracy.

NumbersRAG improves accuracy by 7–29 absolute points (Section 5.2)

Hybrid KG+text methods outperform text-only RAG on relational multi-hop tasks.

NumbersEvoReasoner ≈ 66–76% vs RAG ≈ 41–52% (Table 3)

The KG construction pipeline recovers most verifiable facts from source documents.

NumbersEvoKG fact recovery ≈ 71.36% vs KGGen 66.46% (Table 5)

Naïvely injecting local KG facts can hurt performance.

Numbers1-hop KG sometimes lower than LLM-only (Section 5.2)

Results

Accuracy

Value≈23–40%

RAG improvement over LLM-only

Value+7–29 percentage points

BaselineLLM-only prompting

Accuracy

Value≈66–76%

Baselinetext-only RAG ≈41–52%

KG fact recovery

Value71.36% recovered facts

BaselineKGGen 66.46%

Who Should Care

What To Try In 7 Days

Run a small HybridRAG-Bench instantiation on your domain (choose a recent time window) to measure contamination and retrieval needs.

Compare LLM-only vs text RAG vs simple KG-RAG on a 100-question slice to see retrieval gains.

Test EvoKG or a simple entity-extraction pipeline on a sample corpus and measure fact recovery rate.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Framework currently instantiated on arXiv/scientific texts; behavior on web or commercial data may differ.
  • KG and QA generation use LLMs, so extraction/generation errors can bias difficulty and coverage.
  • Evaluation uses LLM-as-judge, which can introduce judge bias or false positives/negatives.

When Not To Use

  • For tasks that do not require up-to-date or multi-hop relational facts (static trivia).
  • When you cannot build or store a fresh retrieval corpus for your domain.
  • If manual curation or human-grounded evaluation is required for safety-critical outputs.

Failure Modes

  • KG extraction misses or normalizes facts incorrectly, lowering recall.
  • Naïve one-hop KG injection adds noisy facts and reduces accuracy.
  • LLM-as-judge may mis-evaluate subtle or ambiguous answers.

Core Entities

Models

  • EvoReasoner
  • ToG2.0
  • ToG
  • RoG
  • CoK
  • PoG
  • R2-KG
  • HippoRAG2.0
  • RAG
  • DeepSeek V3.2
  • Qwen 2.5
  • LLaMA 3.3
  • LLaMA 3.1

Metrics

  • Accuracy
  • fact recovery rate
  • standard deviation

Datasets

  • HybridRAG-Bench (arXiv-AI)
  • HybridRAG-Bench (arXiv-CY)
  • HybridRAG-Bench (arXiv-BIO)
  • CRAG
  • MINE

Benchmarks

  • HybridRAG-Bench
  • CRAG
  • HotpotQA
  • WebQSP
  • MetaQA