RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

August 21, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Xuanwang Zhang, Yunze Song, Yidong Wang, Shuyun Tang, Xinfeng Li, Zhengran Zeng, Zhen Wu, Wei Ye, Wenyuan Xu, Yue Zhang, Xinyu Dai, Shikun Zhang, Qingsong Wen

Links

Abstract / PDF

Why It Matters For Business

RAGLAB speeds RAG development and fair benchmarking, helping teams pick the right RAG variant and avoid wasted engineering time when reproducing papers.

Summary TLDR

RAGLAB is an open-source, modular toolkit that reproduces six published retrieval-augmented generation (RAG) algorithms and provides retrievers, preprocessed Wikipedia corpora, generator support (HuggingFace/VLLM), trainer hooks (LoRA/QLoRA), metrics, and a small UI. The authors ran fair comparisons of those algorithms across 10 common QA benchmarks using three base generators (Llama3-8B, Llama3-70B, GPT-3.5). Key findings: Self-RAG beats others when paired with a 70B fine-tuned generator; many RAG variants perform similarly; RAG can hurt multiple-choice performance; retriever caching and a retriever server cut query overhead. Code and corpora are released under MIT.

Problem Statement

RAG research lacks a single, transparent platform to reproduce algorithms and run fair, aligned comparisons. Existing toolkits are either too opaque or lack reproduced algorithms, which slows progress and makes cross-paper results hard to trust.

Main Contribution

RAGLAB: a modular, researcher-focused RAG library that reproduces six published algorithms.

Preprocessed Wikipedia corpora and built indices/embeddings for ColBERT and Contriever.

Trainer with LoRA/QLoRA support to fine-tune large generators (up to 70B) efficiently.

Standardized evaluation: same generators, retrievers, instructions, and 10 benchmarks.

Practical utilities: retriever server, retrieval cache, GPU manager, and instruction pools.

Key Findings

Self-RAG outperforms other reproduced RAG algorithms when paired with a 70B fine-tuned generator.

NumbersPopQA ACC 48.8 (Self-RAG adaptive) vs 39.6 (NaiveRag) on Llama3-70B

NaiveRAG, RRR, ITER-RETGEN and Active RAG show broadly similar performance across evaluated benchmarks.

NumbersMany ACC/F1 differences within ~1–4 points across 10 datasets (Tables 3–5)

RAG systems can underperform plain LLMs on multiple-choice tasks where answer options are provided.

NumbersDirect LLM ACC higher than several RAG variants on ARC/MMLU in Experiments (Tables 3–5)

Retriever server and caching reduce repeated retrieval cost and achieve low latency.

NumbersRetrieval service latencies < 0.1 seconds across parallel evaluations

User study: researchers liked the toolkit and felt it speeds research.

Numbers85% said it significantly improved research efficiency; 90% would recommend

Results

PopQA ACC (Llama3-70B)

ValueSelf-RAG adaptive 48.8, NaiveRag 39.6

BaselineNaiveRag 39.6

Multiple-choice (ARC) ACC

ValueDirect LLM 90.4 vs NaiveRag 89.4 (Llama3-70B)

BaselineDirect LLM 90.4

Retriever latency

Value<0.1s per query with server and cache

Who Should Care

What To Try In 7 Days

Clone RAGLAB and run a 500-sample benchmark reproduction using the provided Llama3-8B setup.

Use the retriever server + cache to compare two retrievers and measure latency improvement.

Fine-tune a small generator with LoRA/QLoRA and test Self-RAG vs NaiveRAG on a relevant dataset.

Optimization Features

Infra Optimization

  • DeepSpeed ZeRO stage 3
  • Accelerate integration

Model Optimization

  • LoRA
  • 4-bit quantization (fp4)

System Optimization

  • retrieval caching
  • retriever server
  • GPU allocation module

Training Optimization

  • LoRA
  • full-weight fine-tuning for 8B
  • mixed precision bf16

Inference Optimization

  • vLLM support
  • retriever server and caching
  • GPU management for multiple generators

Reproducibility

License

  • MIT

Data Urls

  • preprocessed Wikipedia corpora and scripts (linked from repo)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only six RAG algorithms reproduced due to compute limits.
  • Evaluations use only Wikipedia 2018/2023 as knowledge sources.
  • Sampled 500 points per dataset, so results are indicative not definitive.
  • Limited metric set (accuracy, F1, FactScore, ALCE); more evaluation axes needed.

When Not To Use

  • When you need broad retriever or corpus diversity beyond Wikipedia.
  • When full-scale production latency and throughput analysis is required.
  • When you require datasets larger than the provided 500-sample subsets for statistical claims.

Failure Modes

  • Cached retrieval may return stale or mismatched passages for dynamic data.
  • Poorly aligned instructions can bias comparisons across algorithms.
  • Retrieval noise can mislead generators, especially in multiple-choice tasks.

Core Entities

Models

  • Llama3-8B
  • Llama3-70B
  • GPT-3.5
  • selfrag-llama3-70B
  • selfrag-llama3-8B
  • Contriever
  • ColBERT

Metrics

  • Accuracy
  • F1
  • exact match
  • FactScore
  • ALCE

Datasets

  • PopQA
  • TriviaQA
  • HotpotQA
  • 2WikiMultiHopQA
  • ARC
  • MMLU
  • PubHealth
  • StrategyQA
  • FactScore
  • ASQA

Benchmarks

  • OpenQA
  • Multi-HopQA
  • Multiple-Choice
  • Fact Verification
  • Long-Form QA