A public dataset and baseline results showing RAG struggles on multi-hop questions that need evidence from multiple documents

Overview

Decision SnapshotNeeds Validation

Paper provides a useful public benchmark and baseline numbers. Use it to isolate retrieval versus reasoning failures and to test rerankers and query-decomposition methods.

Citations13

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Yixuan Tang, Yi Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multi-hop questions (e.g., cross-document finance or product research) are common and current RAG systems often miss key evidence; improving retrieval and reranking yields bigger gains than swapping LLMs alone.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors release MultiHop-RAG, a news-article knowledge base plus 2,556 multi-hop queries (inference, comparison, temporal, null) and ground-truth evidence. They benchmark retrieval (many embedding models + a reranker) and LLM generation (GPT-4, GPT-3.5, PaLM, Claude-2, Llama2-70B, Mixtral). Retrieval quality is limited (best Hits@10 ≈ 0.75 with reranker), and even with perfect evidence some open models struggle; GPT-4 reaches 0.89 accuracy with ground-truth evidence, but on retrieved chunks its accuracy is 0.56. The dataset and code are public on GitHub.

Problem Statement

RAG systems rarely face benchmarks that require linking and reasoning across multiple documents. Existing RAG evaluations target single-document cases. This paper builds a multi-hop RAG dataset and measures how well current embeddings, rerankers, and LLMs retrieve and reason over multiple supporting texts.

Main Contribution

MultiHop-RAG: a public dataset with a news-article knowledge base and 2,556 multi-hop queries labeled with supporting evidence and answers.

A reproducible pipeline that uses GPT-4 to generate claims, bridge-topics/entities, multi-hop queries, and performs manual + GPT-4 quality checks.

Key Findings

Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.

Numbers2,556 queries; 609 articles; avg 2,046 tokens/article

Practical UseYou can test retrieval + multi-document reasoning at realistic scale; expect queries to need 2–4 pieces of evidence.

Evidence RefTable 2, Table 3

Retrieval is a weak link: the best embedding+reranker achieves Hits@10 = 0.7467 and Hits@4 = 0.6625.

Numbersvoyage-02 + bge-reranker-large: Hits@10 0.7467; Hits@4 0.6625

Practical UseIn production, many correct evidence pieces will be missing from the small top-K context. Improve retrieval (reranking, hybrid search, query decomposition) before expecting high end-to-end accuracy.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hits@10 (best embedding + reranker)	0.7467	—	—	MultiHop-RAG retrieval evaluation (non-null queries)	Table 5: voyage-02 with bge-reranker-large	Table 5
Hits@4 (best embedding + reranker)	0.6625	—	—	MultiHop-RAG retrieval evaluation (non-null queries)	Table 5: voyage-02 with bge-reranker-large	Table 5

What To Try In 7 Days

Run the MultiHop-RAG retrieval suite on your embedding models to measure Hits@K and MRR.

Add a lightweight reranker (e.g., bge-reranker-large) and compare top-4 vs top-10 evidence quality.

Measure LLM accuracy with both retrieved and oracle evidence to separate retrieval vs reasoning gaps.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yixuantt/MultiHop-RAG/

Data URLs

https://github.com/yixuantt/MultiHop-RAG/

Risks & Boundaries

Limitations

Answers are limited to short forms (yes/no, entity, before/after) which simplifies evaluation but excludes free-text answers.

Each query uses at most four supporting evidence pieces; longer-chained retrieval is not covered.

When Not To Use

When you need open-ended, long free-text answers or explanations beyond single-word/entity outputs.

For multi-hop scenarios requiring more than four evidence items.

Failure Modes

Retrieval misses: correct evidence often falls outside small top-K contexts.

Logical errors: some LLMs mishandle negation and comparisons.

Core Entities

Models

GPT-4GPT-3.5Claude-2Google-PaLMLlama-2-70b-chat-hfMixtral-8x7B-Instructvoyage-02bge-large-en-v1.5text-embedding-ada-002e5-base-v2instructor-largebge-reranker-large

Metrics

Hits@KMAP@KMRR@KAccuracy

Datasets

MultiHop-RAG (this paper)mediastack news (source)

Benchmarks

MultiHop-RAG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.

Retrieval is a weak link: the best embedding+reranker achieves Hits@10 = 0.7467 and Hits@4 = 0.6625.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding