FeB4RAG — a federated-search dataset built for modern RAG pipelines

Overview

Decision SnapshotNeeds Validation

The dataset is practical and reproducible; LLM labels are promising but require spot-checking due to moderate agreement and labeling choices.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Shuai Wang, Ekaterina Khramtsova, Shengyao Zhuang, Guido Zuccon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When you feed retrieval results to an LLM, which resources you query and how you merge results materially affects answer quality, cost and latency.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

FeB4RAG is a new benchmark for federated search inside Retrieval-Augmented Generation (RAG) systems. It bundles 16 simulated search engines (from BEIR datasets, driven by state-of-the-art dense retrievers), 790 conversational-style information requests, and LLM-derived graded relevance labels at result and engine level. The authors show LLMs can produce reliable labels (Cohen's Kappa ≈ 0.57 vs human), that most top results are judged non-relevant, and that selecting only relevant engines (best-fed) produces better RAG answers than a naive round-robin approach. Code and data are published to extend or re-run experiments.

Problem Statement

Existing federated-search test collections are old, built for ad-hoc web search, and do not reflect modern RAG needs: conversational requests, neural dense retrieval, and how federated selection/merging affects LLM answer quality.

Main Contribution

FeB4RAG dataset: 790 conversational-style requests mapped to 16 BEIR-based 'search engines' simulated with modern dense retrievers.

LLM-based graded relevance labels at result and engine level using two open LLMs and an aggregation protocol.

Key Findings

FeB4RAG contains 790 user requests across 16 simulated search engines derived from BEIR.

Numbers790 requests; 16 datasets; collection size 36.9M docs

Practical UseYou can use a compact, ready-to-run federated-RAG benchmark that reflects conversational requests without building your own federation.

Evidence RefTable 1; Section 3

LLM-derived relevance labels agree moderately with human labels (aggregation performs best).

NumbersCohen's Kappa ≈ 0.57 for aggregated LLM labels vs human

Practical UseLLMs can cheaply scale relevance labeling for federated collections, but validate on a human sample before trusting labels in production.

Evidence RefSection 4.2; Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	790 requests; 16 simulated engines; 36.9M docs	—	—	FeB4RAG (overall)	Table 1; Section 3	—
Annotation distribution (result-level)	NR 59.96%, MR 34.15%, HR 5.57%, Key 0.32%	—	—	Overall (all search results)	Table 4; Section 4.1	—

What To Try In 7 Days

Run FeB4RAG's 80-sample RAG demo to compare naive round-robin vs a simple resource selector.

Measure top-1 relevance rates on your own federated sources; if >50% non-relevant, add selection or filtering.

Replace naive merging with minimal-relevance filtering (drop label=0) and re-run a few RAG prompts to compare outputs.

Agent Features

Tool Use

RAG

Frameworks

langchainllamaindexDSPy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ielab/FeB4RAG

Data URLs

https://github.com/ielab/FeB4RAGBEIR benchmark (public)

Risks & Boundaries

Limitations

Relevance labels rely on two LLMs and aggregation; this can introduce label bias despite moderate Kappa.

Search engines are simulated with dense retrievers only; no live proprietary engines or zero-shot LLM rankers were used.

When Not To Use

If you need federations with proprietary live search engines or APIs not modelled by BEIR-derived retrievers.

If you require fully human-annotated relevance pools for legal or safety-critical audits without LLM validation.

Failure Modes

LLM label bias causes systematic over/under-estimation of relevance for niche verticals.

Resource-selection rules tuned on FeB4RAG might not generalize to live web search engines.

Core Entities

Models

e5-largemultilingual-e5-largeSGPT-5.8BUAE-Large-V1all-mpnet-base-v2gte-basegte-largeinstructor-xlupstage/SOLAR-10.7B-Instruct-v1.0RubielLabarta/LogoS-7Bx2-MoE-13B-v0.1GPT-4 (gpt-4-0125-preview)

Metrics

Cohen's Kappagraded precisionnDCG@k (suggested for eval)

Datasets

BEIR (selected 16 datasets)MS MARCOTREC-COVIDNFCorpusSCIDOCSNQHotpotQAFIQA-2018Signal-1MTREC-NEWSRobust04ArguAnaTouché-2020DBPediaFEVERClimate-FEVERSciFact

Benchmarks

FeB4RAGBEIRMTEB

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FeB4RAG contains 790 user requests across 16 simulated search engines derived from BEIR.

LLM-derived relevance labels agree moderately with human labels (aggregation performs best).

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding