A public dataset and baseline results showing RAG struggles on multi-hop questions that need evidence from multiple documents

January 27, 20247 min

Overview

Decision SnapshotNeeds Validation

Paper provides a useful public benchmark and baseline numbers. Use it to isolate retrieval versus reasoning failures and to test rerankers and query-decomposition methods.

Citations13

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Yixuan Tang, Yi Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multi-hop questions (e.g., cross-document finance or product research) are common and current RAG systems often miss key evidence; improving retrieval and reranking yields bigger gains than swapping LLMs alone.

Who Should Care

Summary TLDR

The authors release MultiHop-RAG, a news-article knowledge base plus 2,556 multi-hop queries (inference, comparison, temporal, null) and ground-truth evidence. They benchmark retrieval (many embedding models + a reranker) and LLM generation (GPT-4, GPT-3.5, PaLM, Claude-2, Llama2-70B, Mixtral). Retrieval quality is limited (best Hits@10 ≈ 0.75 with reranker), and even with perfect evidence some open models struggle; GPT-4 reaches 0.89 accuracy with ground-truth evidence, but on retrieved chunks its accuracy is 0.56. The dataset and code are public on GitHub.

Problem Statement

RAG systems rarely face benchmarks that require linking and reasoning across multiple documents. Existing RAG evaluations target single-document cases. This paper builds a multi-hop RAG dataset and measures how well current embeddings, rerankers, and LLMs retrieve and reason over multiple supporting texts.

Main Contribution

MultiHop-RAG: a public dataset with a news-article knowledge base and 2,556 multi-hop queries labeled with supporting evidence and answers.

A reproducible pipeline that uses GPT-4 to generate claims, bridge-topics/entities, multi-hop queries, and performs manual + GPT-4 quality checks.

Key Findings

Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.

Numbers2,556 queries; 609 articles; avg 2,046 tokens/article

Practical UseYou can test retrieval + multi-document reasoning at realistic scale; expect queries to need 2–4 pieces of evidence.

Evidence RefTable 2, Table 3

Retrieval is a weak link: the best embedding+reranker achieves Hits@10 = 0.7467 and Hits@4 = 0.6625.

Numbersvoyage-02 + bge-reranker-large: Hits@10 0.7467; Hits@4 0.6625

Practical UseIn production, many correct evidence pieces will be missing from the small top-K context. Improve retrieval (reranking, hybrid search, query decomposition) before expecting high end-to-end accuracy.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hits@10 (best embedding + reranker)0.7467MultiHop-RAG retrieval evaluation (non-null queries)Table 5: voyage-02 with bge-reranker-largeTable 5
Hits@4 (best embedding + reranker)0.6625MultiHop-RAG retrieval evaluation (non-null queries)Table 5: voyage-02 with bge-reranker-largeTable 5

What To Try In 7 Days

Run the MultiHop-RAG retrieval suite on your embedding models to measure Hits@K and MRR.

Add a lightweight reranker (e.g., bge-reranker-large) and compare top-4 vs top-10 evidence quality.

Measure LLM accuracy with both retrieved and oracle evidence to separate retrieval vs reasoning gaps.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Answers are limited to short forms (yes/no, entity, before/after) which simplifies evaluation but excludes free-text answers.

Each query uses at most four supporting evidence pieces; longer-chained retrieval is not covered.

When Not To Use

When you need open-ended, long free-text answers or explanations beyond single-word/entity outputs.

For multi-hop scenarios requiring more than four evidence items.

Failure Modes

Retrieval misses: correct evidence often falls outside small top-K contexts.

Logical errors: some LLMs mishandle negation and comparisons.

Core Entities

Models

GPT-4GPT-3.5Claude-2Google-PaLMLlama-2-70b-chat-hfMixtral-8x7B-Instructvoyage-02bge-large-en-v1.5text-embedding-ada-002e5-base-v2instructor-largebge-reranker-large

Metrics

Hits@KMAP@KMRR@KAccuracy

Datasets

MultiHop-RAG (this paper)mediastack news (source)

Benchmarks

MultiHop-RAG