Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
7
Why It Matters For Business
CRUD-RAG helps teams tune the full RAG stack (indexing, retriever, prompt, model) for realistic production tasks and trade accuracy vs recall — saving compute and reducing hallucinations.
Summary TLDR
This paper introduces CRUD-RAG, a large Chinese benchmark that evaluates end-to-end retrieval-augmented generation (RAG) across four practical scenarios: Create (text continuation), Read (single- and multi-document QA), Update (hallucination correction), and Delete (multi-document summarization). The authors build ~86k retrieval documents and task datasets (e.g., 10,728 continuation and summarization examples; ~3.2k per QA split; 5,130 hallucination edits), adapt QuestEval into RAGQuestEval for key-information scoring, run controlled experiments varying chunk size/overlap/top-k/embeddings/retrievers/LLMs, and publish tuning recommendations (e.g., larger chunks/top-k for creative and multi‑do
Problem Statement
Existing RAG benchmarks focus mainly on question answering and on evaluating the LLM piece alone. That leaves out many RAG uses and ignores retrieval database construction, chunking, retriever choice, and non-knowledge‑intensive scenarios. Practitioners need a broad, task-aware benchmark to tune the whole RAG pipeline.
Main Contribution
CRUD-RAG: a scenario-driven Chinese benchmark mapping RAG use to Create/Read/Update/Delete tasks.
Large-scale datasets and retrieval DB: ~86,834 news articles; datasets include 10,728 continuation, 10,728 summarization, 3,199/3,192/3,189 QA splits, and 5,130 hallucination edits.
RAGQuestEval: adapt QuestEval for measuring key-information precision and recall against ground-truth references.
Systematic ablation: controlled experiments on chunk size, overlap, embedding, retriever, top-k, and LLM, with actionable tuning rules.
Key Findings
Chunk size strongly changes task behavior.
Hybrid retrieval with reranking improves QA.
LLM choice changes outcomes; GPT-4 leads across many tasks.
Embedding rankings on retrieval leaderboards do not fully predict RAG utility.
Top-k trades recall for precision depending on task.
Results
text continuation BLEU (chunk size)
multi-document QA (3-doc) RAGQuestEval recall (chunk size)
1-document QA BLEU (retriever)
Summarization RAGQuestEval recall (LLM)
Who Should Care
What To Try In 7 Days
Run CRUD-RAG on a representative subset of your corpus to baseline your RAG pipeline.
Sweep chunk size and top-k per task: larger chunks/top-k for creative or multi‑doc QA; smaller chunks for extractive QA and error correction.
Compare BM25 vs dense vs hybrid+rerank on your queries; prefer hybrid+rerank for reasoning QA when budget allows.
Optimization Features
System Optimization
- tune chunk size and overlap per task
- select retriever type per scenario
- adjust top-k for precision/recall tradeoff
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset focuses on Chinese news; results may not generalize to other languages or domains.
- Many references and examples were generated with GPT-4, which risks model-generation bias in datasets.
- Some LLM experiments ran on 1/5 of the data to control cost; full-scale behavior may differ.
When Not To Use
- When you need domain-specific, high-assurance benchmarks (e.g., legal/medical) beyond news.
- For structured/semi-structured retrieval tasks (tables, code), where this news-based corpus is not representative.
- If you require multilingual evaluation—CRUD-RAG is Chinese-focused.
Failure Modes
- Retriever mismatch: wrong documents retrieved, causing false but fluent answers.
- Context overload: large top-k or large chunks can increase redundancy and lower precision.
- Evaluator bias: RAGQuestEval depends on question generation and answerers that can inherit LLM biases.
Core Entities
Models
- GPT-3.5
- GPT-4
- GPT-4-0613
- GPT-4o
- GPT-4o (reported)
- GPT-4o (May 2024)
- ChatGLM2-6B
- Baichuan2-13B
- Qwen-7B
- Qwen-14B
- Qwen2-7B
Metrics
- BLEU
- ROUGE-L
- BERTScore
- QuestEval
- RAGQuestEval
- MRR
Datasets
- CRUD-RAG
- UHGEval
- RGB
- Natural Questions (NQ)
Benchmarks
- CRUD-RAG
- RGB
- NQ
- ARES
- RAGAS

