XRAG: open-source toolkit and benchmark that tests pre‑retrieval, retrieval, post‑retrieval, and generation modules in RAG

December 20, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Qianren Mao, Yangyifei Luo, Qili Zhang, Yashuo Luo, Zhilong Cao, Jinlong Zhang, HanWen Hao, Zhijun Chen, Weifeng Jiang, Junnan Liu, Xiaolong Wang, Zhenting Huang, Zhixing Tan, Sun Jie, Bo Li, Xudong Liu, Richong Zhang, Jianxin Li

Links

Abstract / PDF

Why It Matters For Business

XRAG helps teams identify which retrieval or reranking change actually improves end-to-end QA accuracy, reducing guesswork and wasted engineering time when deploying RAG-powered search or assistant features.

Summary TLDR

XRAG is an open-source, modular toolkit and benchmark for analyzing the four core stages of Retrieval-Augmented Generation (RAG): pre-retrieval (query rewriting), retrieval, post-retrieval (reranking/compaction), and generation. It standardizes three datasets (HotpotQA, DropQA, NaturalQA), provides unified data formats, bundles three evaluators (Conventional Retrieval, Conventional Generation, Cognitive LLM-based), and ships diagnostics and targeted optimizations for common RAG failures (ranking confusion, missing answers, noise, negative refusal, complex reasoning). Key empirical takeaways: hybrid retrieval plus re-ranking yields the largest retrieval gains; rerankers (BGE‑RRK, Jina‑RRK) or

Problem Statement

RAG systems mix many interchangeable modules but lack a consistent, modular benchmark that isolates pre-retrieval, retrieval, post-retrieval, and generation components. That gap makes it hard to compare methods, find failure points, or test fixes under identical conditions.

Main Contribution

An open-source, modular codebase (XRAG) to assemble and test RAG pipelines component-by-component.

A unified dataset format and packaged versions of HotpotQA, DropQA, and NaturalQA for joint retrieval+generation evaluation.

A multidimensional evaluator suite: Conventional Retrieval (ConR), Conventional Generation (ConG), and Cognitive LLM (CogL) evaluations.

Systematic failure diagnostics and targeted fixes (e.g., hybrid retrieval + re-ranking, query rewriting, per-chunk Q&A).

A developer Web UI and config system for reproducible experiments and monitoring.

Key Findings

Combining hybrid retrieval (BM25 + vector) with re-ranking yields large retrieval gains.

NumbersF1: 0.975 vs baseline 0.740 (Table 12)

Rerankers substantially improve retrieval metrics over basic retrievers.

NumbersBGE‑RRK F1 66.90 (Table 4) vs basic retriever lower

Cognitive LLM evaluation shows retrieved contexts are highly relevant but generated text often diverges from gold answers.

NumbersDp-ARel (response relevance) 0.9367/0.9900/0.9100 for HQA/DQA/NQA (Table 8)

Longer queries improve some retrieval and generation metrics.

NumbersHit@1 and METEOR trend upward with query length (Figure 3)

Adding more retrieved contexts yields only small average gains.

NumbersROUGE‑1 mean rises ~0.31 → 0.41 as contexts increase (text & Figure 4)

Results

Retrieval F1 (hybrid + re-rank)

Value0.975

Baseline0.740 (Basic-RAG)

Response relevance (CogL Dp-ARel)

Value0.9367 / 0.9900 / 0.9100

ROUGE‑1 mean change with more contexts

Value0.31 → 0.41

Baselinefewer contexts

LLM Perplexity (Llama3.1‑8B on HQA)

ValuePPL 181.71

Who Should Care

What To Try In 7 Days

Run XRAG on a small sample of your QA queries to compare your retriever vs hybrid+re-ranker.

Enable a cross‑encoder reranker (BGE‑RRK or ColBERTv2) as a post-process and measure Hit@1/NDCG.

Test simple query rewriting (HyDE or step-back prompts) to see quick retrieval boosts for hard queries.

Optimization Features

Token Efficiency

  • Sampling-based test sets to cut token cost
  • Chunk size 128 and overlap 20 to control index granularity

Infra Optimization

  • Supports local deployment via Ollama and HuggingFace; GPU requirements documented

System Optimization

  • Modular config + Web UI for rapid ablation
  • Standardized unified dataset format

Inference Optimization

  • Add re-ranking as a post-process
  • Use compact/accumulate strategies to reduce LLM calls

Reproducibility

License

  • dataset: CC BY-NC-SA 4.0; code license not specified in paper

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not include training scripts for RAG components; focuses on evaluation only.
  • Coverage limited to three QA datasets (HotpotQA, DropQA, NaturalQA).
  • Cognitive LLM evaluations (CogL) were limited to pilot studies due to token cost.

When Not To Use

  • When you need end-to-end training of retrievers or fine-tuning (XRAG does not train components).
  • If your application uses QA types not in the packaged datasets (e.g., multiple-choice, long-form OpenQA) without extension.
  • When token budget prevents any LLM‑based CogL evaluation.

Failure Modes

  • Negative refusal (model refuses or fabricates answers under uncertainty)
  • Ranking confusion (relevant context ranked too low or late in input)
  • Answer absence (LLM misses answers even when contexts present)
  • Noise impact (irrelevant retrieved chunks harm answers)
  • Complex reasoning failures (cross-document multi-hop reasoning gaps)

Core Entities

Models

  • BGE-large
  • Jina-Large
  • Llama3.1-8B
  • DeepSeek R1-7B
  • GPT-3.5 Turbo
  • GPT-4 Turbo
  • ColBERTv2

Metrics

  • F1
  • MRR
  • Hit@1
  • Hit@10
  • MAP
  • NDCG
  • DCG
  • IDCG
  • ChrF
  • ChrF++
  • METEOR
  • ROUGE (R1,R2,RL)
  • Perplexity (PPL)
  • WER
  • CER
  • Dp-ARel
  • Up-FAcc

Datasets

  • HotpotQA
  • DropQA (DROP)
  • NaturalQA

Benchmarks

  • ConR (Conventional Retrieval)
  • ConG (Conventional Generation)
  • CogL (Cognitive LLM Evaluation)