XRAG: open-source toolkit and benchmark that tests pre‑retrieval, retrieval, post‑retrieval, and generation modules in RAG

Overview

Decision SnapshotNeeds Validation

XRAG is a practical, well‑documented benchmark and toolkit that makes it easy to compare and debug RAG components; experiments across three datasets back most claims.

Citations0

Evidence Strength0.80

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

License: dataset: CC BY-NC-SA 4.0; code license not specified in paper

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Qianren Mao, Yangyifei Luo, Qili Zhang, Yashuo Luo, Zhilong Cao, Jinlong Zhang, HanWen Hao, Zhijun Chen, Weifeng Jiang, Junnan Liu, Xiaolong Wang, Zhenting Huang, Zhixing Tan, Sun Jie, Bo Li, Xudong Liu, Richong Zhang, Jianxin Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

XRAG helps teams identify which retrieval or reranking change actually improves end-to-end QA accuracy, reducing guesswork and wasted engineering time when deploying RAG-powered search or assistant features.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

XRAG is an open-source, modular toolkit and benchmark for analyzing the four core stages of Retrieval-Augmented Generation (RAG): pre-retrieval (query rewriting), retrieval, post-retrieval (reranking/compaction), and generation. It standardizes three datasets (HotpotQA, DropQA, NaturalQA), provides unified data formats, bundles three evaluators (Conventional Retrieval, Conventional Generation, Cognitive LLM-based), and ships diagnostics and targeted optimizations for common RAG failures (ranking confusion, missing answers, noise, negative refusal, complex reasoning). Key empirical takeaways: hybrid retrieval plus re-ranking yields the largest retrieval gains; rerankers (BGE‑RRK, Jina‑RRK) or

Problem Statement

RAG systems mix many interchangeable modules but lack a consistent, modular benchmark that isolates pre-retrieval, retrieval, post-retrieval, and generation components. That gap makes it hard to compare methods, find failure points, or test fixes under identical conditions.

Main Contribution

An open-source, modular codebase (XRAG) to assemble and test RAG pipelines component-by-component.

A unified dataset format and packaged versions of HotpotQA, DropQA, and NaturalQA for joint retrieval+generation evaluation.

Key Findings

Combining hybrid retrieval (BM25 + vector) with re-ranking yields large retrieval gains.

NumbersF1: 0.975 vs baseline 0.740 (Table 12)

Practical UseIf retrieval precision matters, add a hybrid retriever and a re-ranker — this gives the biggest lift in ordered retrieval quality in tested setups.

Evidence RefTable 12

Rerankers substantially improve retrieval metrics over basic retrievers.

NumbersBGE‑RRK F1 66.90 (Table 4) vs basic retriever lower

Practical UseUse a cross‑encoder reranker (e.g., BGE‑RRK or Jina‑RRK) as a cheap post-process step to boost Hit@1 and NDCG on hard datasets.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Retrieval F1 (hybrid + re-rank)	0.975	0.740 (Basic-RAG)	+0.235	sampled low-F1 set (Table 12)	Hybrid retrieval + re-ranking produced F1 0.975 (vs 0.740 baseline).	Table 12
Response relevance (CogL Dp-ARel)	0.9367 / 0.9900 / 0.9100	—	—	HQA / DQA / NQA (Table 8)	CogL shows high response relevance scores across datasets.	Table 8

What To Try In 7 Days

Run XRAG on a small sample of your QA queries to compare your retriever vs hybrid+re-ranker.

Enable a cross‑encoder reranker (BGE‑RRK or ColBERTv2) as a post-process and measure Hit@1/NDCG.

Test simple query rewriting (HyDE or step-back prompts) to see quick retrieval boosts for hard queries.

Optimization Features

Token Efficiency

Sampling-based test sets to cut token costChunk size 128 and overlap 20 to control index granularity

Infra Optimization

Supports local deployment via Ollama and HuggingFace; GPU requirements documented

System Optimization

Modular config + Web UI for rapid ablationStandardized unified dataset format

Inference Optimization

Add re-ranking as a post-processUse compact/accumulate strategies to reduce LLM calls

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

Licensedataset: CC BY-NC-SA 4.0; code license not specified in paper

Code URLs

https://github.com/DocAILab/XRAG

Data URLs

https://huggingface.co/datasets/DocAILab/XRAG_Dataset

Risks & Boundaries

Limitations

Does not include training scripts for RAG components; focuses on evaluation only.

Coverage limited to three QA datasets (HotpotQA, DropQA, NaturalQA).

When Not To Use

When you need end-to-end training of retrievers or fine-tuning (XRAG does not train components).

If your application uses QA types not in the packaged datasets (e.g., multiple-choice, long-form OpenQA) without extension.

Failure Modes

Negative refusal (model refuses or fabricates answers under uncertainty)

Ranking confusion (relevant context ranked too low or late in input)

Core Entities

Models

BGE-largeJina-LargeLlama3.1-8BDeepSeek R1-7BGPT-3.5 TurboGPT-4 TurboColBERTv2

Metrics

F1MRRHit@1Hit@10MAPNDCGDCGIDCGChrFChrF++METEORROUGE (R1,R2,RL)Perplexity (PPL)WERCERDp-ARelUp-FAcc

Datasets

HotpotQADropQA (DROP)NaturalQA

Benchmarks

ConR (Conventional Retrieval)ConG (Conventional Generation)CogL (Cognitive LLM Evaluation)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Combining hybrid retrieval (BM25 + vector) with re-ranking yields large retrieval gains.

Rerankers substantially improve retrieval metrics over basic retrievers.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f