CRUD-RAG: a Chinese benchmark testing RAG across Create / Read / Update / Delete tasks

Overview

Decision SnapshotReady For Pilot

The benchmark is a practical, well-documented testbed that reveals how indexing, retrieval, and LLM choice interact; experiments are broad but some experiments used a dataset subset for cost reasons.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, Enhong Chen

Links

Abstract / PDF / Code

Why It Matters For Business

CRUD-RAG helps teams tune the full RAG stack (indexing, retriever, prompt, model) for realistic production tasks and trade accuracy vs recall — saving compute and reducing hallucinations.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead Founder

Summary TLDR

This paper introduces CRUD-RAG, a large Chinese benchmark that evaluates end-to-end retrieval-augmented generation (RAG) across four practical scenarios: Create (text continuation), Read (single- and multi-document QA), Update (hallucination correction), and Delete (multi-document summarization). The authors build ~86k retrieval documents and task datasets (e.g., 10,728 continuation and summarization examples; ~3.2k per QA split; 5,130 hallucination edits), adapt QuestEval into RAGQuestEval for key-information scoring, run controlled experiments varying chunk size/overlap/top-k/embeddings/retrievers/LLMs, and publish tuning recommendations (e.g., larger chunks/top-k for creative and multi‑do

Problem Statement

Existing RAG benchmarks focus mainly on question answering and on evaluating the LLM piece alone. That leaves out many RAG uses and ignores retrieval database construction, chunking, retriever choice, and non-knowledge‑intensive scenarios. Practitioners need a broad, task-aware benchmark to tune the whole RAG pipeline.

Main Contribution

CRUD-RAG: a scenario-driven Chinese benchmark mapping RAG use to Create/Read/Update/Delete tasks.

Large-scale datasets and retrieval DB: ~86,834 news articles; datasets include 10,728 continuation, 10,728 summarization, 3,199/3,192/3,189 QA splits, and 5,130 hallucination edits.

Key Findings

Chunk size strongly changes task behavior.

NumbersContinuation BLEU 3.42 (64) → 5.12 (512); RAGQuestEval recall 23.39% → 28.27% (same rows)

Practical UseUse larger chunks for creative continuation and multi‑doc reasoning; use smaller chunks for single‑sentence extractive QA and fine-grained error correction.

Evidence RefTable 3 (text continuation, chunk size)

Hybrid retrieval with reranking improves QA.

Numbers1-doc QA BLEU: dense 39.76 → hybrid+rerank 40.63

Practical UseWhen accuracy matters for reasoning QA, prefer hybrid + rerank pipelines (combine BM25 + dense, then rerank).

Evidence RefTable 5 (retriever results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
text continuation BLEU (chunk size)	BLEU 3.42 → 5.12	chunk=64	+1.70	text continuation	Table 3 (BLEU by chunk size)	Table 3
multi-document QA (3-doc) RAGQuestEval recall (chunk size)	recall 47.95% → 57.38%	chunk=64	+9.43 pp	question 3-document	Table 3 (3-doc QA recall by chunk size)	Table 3

What To Try In 7 Days

Run CRUD-RAG on a representative subset of your corpus to baseline your RAG pipeline.

Sweep chunk size and top-k per task: larger chunks/top-k for creative or multi‑doc QA; smaller chunks for extractive QA and error correction.

Compare BM25 vs dense vs hybrid+rerank on your queries; prefer hybrid+rerank for reasoning QA when budget allows.

Optimization Features

System Optimization

tune chunk size and overlap per taskselect retriever type per scenarioadjust top-k for precision/recall tradeoff

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IAAR-Shanghai/CRUD_RAG

Risks & Boundaries

Limitations

Dataset focuses on Chinese news; results may not generalize to other languages or domains.

Many references and examples were generated with GPT-4, which risks model-generation bias in datasets.

When Not To Use

When you need domain-specific, high-assurance benchmarks (e.g., legal/medical) beyond news.

For structured/semi-structured retrieval tasks (tables, code), where this news-based corpus is not representative.

Failure Modes

Retriever mismatch: wrong documents retrieved, causing false but fluent answers.

Context overload: large top-k or large chunks can increase redundancy and lower precision.

Core Entities

Models

GPT-3.5GPT-4GPT-4-0613GPT-4oGPT-4o (reported)GPT-4o (May 2024)ChatGLM2-6BBaichuan2-13BQwen-7BQwen-14BQwen2-7B

Metrics

BLEUROUGE-LBERTScoreQuestEvalRAGQuestEvalMRR

Datasets

CRUD-RAGUHGEvalRGBNatural Questions (NQ)

Benchmarks

CRUD-RAGRGBNQARESRAGAS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Chunk size strongly changes task behavior.

Hybrid retrieval with reranking improves QA.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding