FeB4RAG — a federated-search dataset built for modern RAG pipelines

February 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The dataset is practical and reproducible; LLM labels are promising but require spot-checking due to moderate agreement and labeling choices.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Shuai Wang, Ekaterina Khramtsova, Shengyao Zhuang, Guido Zuccon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When you feed retrieval results to an LLM, which resources you query and how you merge results materially affects answer quality, cost and latency.

Who Should Care

Summary TLDR

FeB4RAG is a new benchmark for federated search inside Retrieval-Augmented Generation (RAG) systems. It bundles 16 simulated search engines (from BEIR datasets, driven by state-of-the-art dense retrievers), 790 conversational-style information requests, and LLM-derived graded relevance labels at result and engine level. The authors show LLMs can produce reliable labels (Cohen's Kappa ≈ 0.57 vs human), that most top results are judged non-relevant, and that selecting only relevant engines (best-fed) produces better RAG answers than a naive round-robin approach. Code and data are published to extend or re-run experiments.

Problem Statement

Existing federated-search test collections are old, built for ad-hoc web search, and do not reflect modern RAG needs: conversational requests, neural dense retrieval, and how federated selection/merging affects LLM answer quality.

Main Contribution

FeB4RAG dataset: 790 conversational-style requests mapped to 16 BEIR-based 'search engines' simulated with modern dense retrievers.

LLM-based graded relevance labels at result and engine level using two open LLMs and an aggregation protocol.

Key Findings

FeB4RAG contains 790 user requests across 16 simulated search engines derived from BEIR.

Numbers790 requests; 16 datasets; collection size 36.9M docs

Practical UseYou can use a compact, ready-to-run federated-RAG benchmark that reflects conversational requests without building your own federation.

Evidence RefTable 1; Section 3

LLM-derived relevance labels agree moderately with human labels (aggregation performs best).

NumbersCohen's Kappa ≈ 0.57 for aggregated LLM labels vs human

Practical UseLLMs can cheaply scale relevance labeling for federated collections, but validate on a human sample before trusting labels in production.

Evidence RefSection 4.2; Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size790 requests; 16 simulated engines; 36.9M docsFeB4RAG (overall)Table 1; Section 3
Annotation distribution (result-level)NR 59.96%, MR 34.15%, HR 5.57%, Key 0.32%Overall (all search results)Table 4; Section 4.1

What To Try In 7 Days

Run FeB4RAG's 80-sample RAG demo to compare naive round-robin vs a simple resource selector.

Measure top-1 relevance rates on your own federated sources; if >50% non-relevant, add selection or filtering.

Replace naive merging with minimal-relevance filtering (drop label=0) and re-run a few RAG prompts to compare outputs.

Agent Features

Tool Use
RAG
Frameworks
langchainllamaindexDSPy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/ielab/FeB4RAGBEIR benchmark (public)

Risks & Boundaries

Limitations

Relevance labels rely on two LLMs and aggregation; this can introduce label bias despite moderate Kappa.

Search engines are simulated with dense retrievers only; no live proprietary engines or zero-shot LLM rankers were used.

When Not To Use

If you need federations with proprietary live search engines or APIs not modelled by BEIR-derived retrievers.

If you require fully human-annotated relevance pools for legal or safety-critical audits without LLM validation.

Failure Modes

LLM label bias causes systematic over/under-estimation of relevance for niche verticals.

Resource-selection rules tuned on FeB4RAG might not generalize to live web search engines.

Core Entities

Models

e5-largemultilingual-e5-largeSGPT-5.8BUAE-Large-V1all-mpnet-base-v2gte-basegte-largeinstructor-xlupstage/SOLAR-10.7B-Instruct-v1.0RubielLabarta/LogoS-7Bx2-MoE-13B-v0.1GPT-4 (gpt-4-0125-preview)

Metrics

Cohen's Kappagraded precisionnDCG@k (suggested for eval)

Datasets

BEIR (selected 16 datasets)MS MARCOTREC-COVIDNFCorpusSCIDOCSNQHotpotQAFIQA-2018Signal-1MTREC-NEWSRobust04ArguAnaTouché-2020DBPediaFEVERClimate-FEVERSciFact

Benchmarks

FeB4RAGBEIRMTEB