Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
XRAG helps teams identify which retrieval or reranking change actually improves end-to-end QA accuracy, reducing guesswork and wasted engineering time when deploying RAG-powered search or assistant features.
Summary TLDR
XRAG is an open-source, modular toolkit and benchmark for analyzing the four core stages of Retrieval-Augmented Generation (RAG): pre-retrieval (query rewriting), retrieval, post-retrieval (reranking/compaction), and generation. It standardizes three datasets (HotpotQA, DropQA, NaturalQA), provides unified data formats, bundles three evaluators (Conventional Retrieval, Conventional Generation, Cognitive LLM-based), and ships diagnostics and targeted optimizations for common RAG failures (ranking confusion, missing answers, noise, negative refusal, complex reasoning). Key empirical takeaways: hybrid retrieval plus re-ranking yields the largest retrieval gains; rerankers (BGE‑RRK, Jina‑RRK) or
Problem Statement
RAG systems mix many interchangeable modules but lack a consistent, modular benchmark that isolates pre-retrieval, retrieval, post-retrieval, and generation components. That gap makes it hard to compare methods, find failure points, or test fixes under identical conditions.
Main Contribution
An open-source, modular codebase (XRAG) to assemble and test RAG pipelines component-by-component.
A unified dataset format and packaged versions of HotpotQA, DropQA, and NaturalQA for joint retrieval+generation evaluation.
A multidimensional evaluator suite: Conventional Retrieval (ConR), Conventional Generation (ConG), and Cognitive LLM (CogL) evaluations.
Systematic failure diagnostics and targeted fixes (e.g., hybrid retrieval + re-ranking, query rewriting, per-chunk Q&A).
A developer Web UI and config system for reproducible experiments and monitoring.
Key Findings
Combining hybrid retrieval (BM25 + vector) with re-ranking yields large retrieval gains.
Rerankers substantially improve retrieval metrics over basic retrievers.
Cognitive LLM evaluation shows retrieved contexts are highly relevant but generated text often diverges from gold answers.
Longer queries improve some retrieval and generation metrics.
Adding more retrieved contexts yields only small average gains.
Results
Retrieval F1 (hybrid + re-rank)
Response relevance (CogL Dp-ARel)
ROUGE‑1 mean change with more contexts
LLM Perplexity (Llama3.1‑8B on HQA)
Who Should Care
What To Try In 7 Days
Run XRAG on a small sample of your QA queries to compare your retriever vs hybrid+re-ranker.
Enable a cross‑encoder reranker (BGE‑RRK or ColBERTv2) as a post-process and measure Hit@1/NDCG.
Test simple query rewriting (HyDE or step-back prompts) to see quick retrieval boosts for hard queries.
Optimization Features
Token Efficiency
- Sampling-based test sets to cut token cost
- Chunk size 128 and overlap 20 to control index granularity
Infra Optimization
- Supports local deployment via Ollama and HuggingFace; GPU requirements documented
System Optimization
- Modular config + Web UI for rapid ablation
- Standardized unified dataset format
Inference Optimization
- Add re-ranking as a post-process
- Use compact/accumulate strategies to reduce LLM calls
Reproducibility
License
- dataset: CC BY-NC-SA 4.0; code license not specified in paper
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not include training scripts for RAG components; focuses on evaluation only.
- Coverage limited to three QA datasets (HotpotQA, DropQA, NaturalQA).
- Cognitive LLM evaluations (CogL) were limited to pilot studies due to token cost.
When Not To Use
- When you need end-to-end training of retrievers or fine-tuning (XRAG does not train components).
- If your application uses QA types not in the packaged datasets (e.g., multiple-choice, long-form OpenQA) without extension.
- When token budget prevents any LLM‑based CogL evaluation.
Failure Modes
- Negative refusal (model refuses or fabricates answers under uncertainty)
- Ranking confusion (relevant context ranked too low or late in input)
- Answer absence (LLM misses answers even when contexts present)
- Noise impact (irrelevant retrieved chunks harm answers)
- Complex reasoning failures (cross-document multi-hop reasoning gaps)
Core Entities
Models
- BGE-large
- Jina-Large
- Llama3.1-8B
- DeepSeek R1-7B
- GPT-3.5 Turbo
- GPT-4 Turbo
- ColBERTv2
Metrics
- F1
- MRR
- Hit@1
- Hit@10
- MAP
- NDCG
- DCG
- IDCG
- ChrF
- ChrF++
- METEOR
- ROUGE (R1,R2,RL)
- Perplexity (PPL)
- WER
- CER
- Dp-ARel
- Up-FAcc
Datasets
- HotpotQA
- DropQA (DROP)
- NaturalQA
Benchmarks
- ConR (Conventional Retrieval)
- ConG (Conventional Generation)
- CogL (Cognitive LLM Evaluation)

