CRUD-RAG: a Chinese benchmark testing RAG across Create / Read / Update / Delete tasks

January 30, 20248 min

Overview

Decision SnapshotReady For Pilot

The benchmark is a practical, well-documented testbed that reveals how indexing, retrieval, and LLM choice interact; experiments are broad but some experiments used a dataset subset for cost reasons.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, Enhong Chen

Links

Abstract / PDF / Code

Why It Matters For Business

CRUD-RAG helps teams tune the full RAG stack (indexing, retriever, prompt, model) for realistic production tasks and trade accuracy vs recall — saving compute and reducing hallucinations.

Who Should Care

Summary TLDR

This paper introduces CRUD-RAG, a large Chinese benchmark that evaluates end-to-end retrieval-augmented generation (RAG) across four practical scenarios: Create (text continuation), Read (single- and multi-document QA), Update (hallucination correction), and Delete (multi-document summarization). The authors build ~86k retrieval documents and task datasets (e.g., 10,728 continuation and summarization examples; ~3.2k per QA split; 5,130 hallucination edits), adapt QuestEval into RAGQuestEval for key-information scoring, run controlled experiments varying chunk size/overlap/top-k/embeddings/retrievers/LLMs, and publish tuning recommendations (e.g., larger chunks/top-k for creative and multi‑do

Problem Statement

Existing RAG benchmarks focus mainly on question answering and on evaluating the LLM piece alone. That leaves out many RAG uses and ignores retrieval database construction, chunking, retriever choice, and non-knowledge‑intensive scenarios. Practitioners need a broad, task-aware benchmark to tune the whole RAG pipeline.

Main Contribution

CRUD-RAG: a scenario-driven Chinese benchmark mapping RAG use to Create/Read/Update/Delete tasks.

Large-scale datasets and retrieval DB: ~86,834 news articles; datasets include 10,728 continuation, 10,728 summarization, 3,199/3,192/3,189 QA splits, and 5,130 hallucination edits.

Key Findings

Chunk size strongly changes task behavior.

NumbersContinuation BLEU 3.42 (64) → 5.12 (512); RAGQuestEval recall 23.39%28.27% (same rows)

Practical UseUse larger chunks for creative continuation and multi‑doc reasoning; use smaller chunks for single‑sentence extractive QA and fine-grained error correction.

Evidence RefTable 3 (text continuation, chunk size)

Hybrid retrieval with reranking improves QA.

Numbers1-doc QA BLEU: dense 39.76 → hybrid+rerank 40.63

Practical UseWhen accuracy matters for reasoning QA, prefer hybrid + rerank pipelines (combine BM25 + dense, then rerank).

Evidence RefTable 5 (retriever results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
text continuation BLEU (chunk size)BLEU 3.425.12chunk=64+1.70text continuationTable 3 (BLEU by chunk size)Table 3
multi-document QA (3-doc) RAGQuestEval recall (chunk size)recall 47.95%57.38%chunk=64+9.43 ppquestion 3-documentTable 3 (3-doc QA recall by chunk size)Table 3

What To Try In 7 Days

Run CRUD-RAG on a representative subset of your corpus to baseline your RAG pipeline.

Sweep chunk size and top-k per task: larger chunks/top-k for creative or multi‑doc QA; smaller chunks for extractive QA and error correction.

Compare BM25 vs dense vs hybrid+rerank on your queries; prefer hybrid+rerank for reasoning QA when budget allows.

Optimization Features

System Optimization
tune chunk size and overlap per taskselect retriever type per scenarioadjust top-k for precision/recall tradeoff

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Dataset focuses on Chinese news; results may not generalize to other languages or domains.

Many references and examples were generated with GPT-4, which risks model-generation bias in datasets.

When Not To Use

When you need domain-specific, high-assurance benchmarks (e.g., legal/medical) beyond news.

For structured/semi-structured retrieval tasks (tables, code), where this news-based corpus is not representative.

Failure Modes

Retriever mismatch: wrong documents retrieved, causing false but fluent answers.

Context overload: large top-k or large chunks can increase redundancy and lower precision.

Core Entities

Models

GPT-3.5GPT-4GPT-4-0613GPT-4oGPT-4o (reported)GPT-4o (May 2024)ChatGLM2-6BBaichuan2-13BQwen-7BQwen-14BQwen2-7B

Metrics

BLEUROUGE-LBERTScoreQuestEvalRAGQuestEvalMRR

Datasets

CRUD-RAGUHGEvalRGBNatural Questions (NQ)

Benchmarks

CRUD-RAGRGBNQARESRAGAS