Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
13
Why It Matters For Business
Multi-hop questions (e.g., cross-document finance or product research) are common and current RAG systems often miss key evidence; improving retrieval and reranking yields bigger gains than swapping LLMs alone.
Summary TLDR
The authors release MultiHop-RAG, a news-article knowledge base plus 2,556 multi-hop queries (inference, comparison, temporal, null) and ground-truth evidence. They benchmark retrieval (many embedding models + a reranker) and LLM generation (GPT-4, GPT-3.5, PaLM, Claude-2, Llama2-70B, Mixtral). Retrieval quality is limited (best Hits@10 ≈ 0.75 with reranker), and even with perfect evidence some open models struggle; GPT-4 reaches 0.89 accuracy with ground-truth evidence, but on retrieved chunks its accuracy is 0.56. The dataset and code are public on GitHub.
Problem Statement
RAG systems rarely face benchmarks that require linking and reasoning across multiple documents. Existing RAG evaluations target single-document cases. This paper builds a multi-hop RAG dataset and measures how well current embeddings, rerankers, and LLMs retrieve and reason over multiple supporting texts.
Main Contribution
MultiHop-RAG: a public dataset with a news-article knowledge base and 2,556 multi-hop queries labeled with supporting evidence and answers.
A reproducible pipeline that uses GPT-4 to generate claims, bridge-topics/entities, multi-hop queries, and performs manual + GPT-4 quality checks.
Benchmarks showing retrieval is a major bottleneck and that LLMs vary widely: best retrieval Hits@10 ≈ 0.7467; GPT-4 accuracy 0.56 with retrieved chunks and 0.89 with ground-truth evidence.
Key Findings
Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.
Retrieval is a weak link: the best embedding+reranker achieves Hits@10 = 0.7467 and Hits@4 = 0.6625.
LLM reasoning varies: GPT-4 gets 0.56 accuracy with retrieved chunks and 0.89 with ground-truth evidence; open-source models lag.
Results
Hits@10 (best embedding + reranker)
Hits@4 (best embedding + reranker)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run the MultiHop-RAG retrieval suite on your embedding models to measure Hits@K and MRR.
Add a lightweight reranker (e.g., bge-reranker-large) and compare top-4 vs top-10 evidence quality.
Measure LLM accuracy with both retrieved and oracle evidence to separate retrieval vs reasoning gaps.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Answers are limited to short forms (yes/no, entity, before/after) which simplifies evaluation but excludes free-text answers.
- Each query uses at most four supporting evidence pieces; longer-chained retrieval is not covered.
- Knowledge base is English news only and covers a short recent time window (Sep–Dec 2023).
When Not To Use
- When you need open-ended, long free-text answers or explanations beyond single-word/entity outputs.
- For multi-hop scenarios requiring more than four evidence items.
- For non-English or domain-specific corpora without additional validation.
Failure Modes
- Retrieval misses: correct evidence often falls outside small top-K contexts.
- Logical errors: some LLMs mishandle negation and comparisons.
- Temporal ordering mistakes: models can misinterpret event sequences.
- Hallucination risk in null queries if retrieval misleads the LLM (though GPT-4 was relatively robust).
Core Entities
Models
- GPT-4
- GPT-3.5
- Claude-2
- Google-PaLM
- Llama-2-70b-chat-hf
- Mixtral-8x7B-Instruct
- voyage-02
- bge-large-en-v1.5
- text-embedding-ada-002
- e5-base-v2
- instructor-large
- bge-reranker-large
Metrics
- Hits@K
- MAP@K
- MRR@K
- Accuracy
Datasets
- MultiHop-RAG (this paper)
- mediastack news (source)
Benchmarks
- MultiHop-RAG

