CRUD-RAG: a Chinese benchmark testing RAG across Create / Read / Update / Delete tasks

January 30, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

7

Authors

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, Enhong Chen

Links

Abstract / PDF

Why It Matters For Business

CRUD-RAG helps teams tune the full RAG stack (indexing, retriever, prompt, model) for realistic production tasks and trade accuracy vs recall — saving compute and reducing hallucinations.

Summary TLDR

This paper introduces CRUD-RAG, a large Chinese benchmark that evaluates end-to-end retrieval-augmented generation (RAG) across four practical scenarios: Create (text continuation), Read (single- and multi-document QA), Update (hallucination correction), and Delete (multi-document summarization). The authors build ~86k retrieval documents and task datasets (e.g., 10,728 continuation and summarization examples; ~3.2k per QA split; 5,130 hallucination edits), adapt QuestEval into RAGQuestEval for key-information scoring, run controlled experiments varying chunk size/overlap/top-k/embeddings/retrievers/LLMs, and publish tuning recommendations (e.g., larger chunks/top-k for creative and multi‑do

Problem Statement

Existing RAG benchmarks focus mainly on question answering and on evaluating the LLM piece alone. That leaves out many RAG uses and ignores retrieval database construction, chunking, retriever choice, and non-knowledge‑intensive scenarios. Practitioners need a broad, task-aware benchmark to tune the whole RAG pipeline.

Main Contribution

CRUD-RAG: a scenario-driven Chinese benchmark mapping RAG use to Create/Read/Update/Delete tasks.

Large-scale datasets and retrieval DB: ~86,834 news articles; datasets include 10,728 continuation, 10,728 summarization, 3,199/3,192/3,189 QA splits, and 5,130 hallucination edits.

RAGQuestEval: adapt QuestEval for measuring key-information precision and recall against ground-truth references.

Systematic ablation: controlled experiments on chunk size, overlap, embedding, retriever, top-k, and LLM, with actionable tuning rules.

Key Findings

Chunk size strongly changes task behavior.

NumbersContinuation BLEU 3.42 (64) → 5.12 (512); RAGQuestEval recall 23.39% → 28.27% (same rows)

Hybrid retrieval with reranking improves QA.

Numbers1-doc QA BLEU: dense 39.76 → hybrid+rerank 40.63

LLM choice changes outcomes; GPT-4 leads across many tasks.

NumbersSummarization RAGQuestEval recall: GPT-3.5 46.18 → GPT-4 50.53 (+4.35)

Embedding rankings on retrieval leaderboards do not fully predict RAG utility.

Numbersm3e-base underperforms on single-doc QA but outperforms on hallucination edits (precision 65.87 vs 65.07; recall 81.69%)

Top-k trades recall for precision depending on task.

NumbersSummarization BLEU drops with higher top-k; RAGQuestEval recall increases while precision decreases (see Table 7)

Results

text continuation BLEU (chunk size)

ValueBLEU 3.42 → 5.12

Baselinechunk=64

multi-document QA (3-doc) RAGQuestEval recall (chunk size)

Valuerecall 47.95% → 57.38%

Baselinechunk=64

1-document QA BLEU (retriever)

ValueBLEU dense 39.76 → hybrid+rerank 40.63

Baselinedense

Summarization RAGQuestEval recall (LLM)

Valuerecall GPT-3.5 46.18% → GPT-4 50.53%

BaselineGPT-3.5

Who Should Care

What To Try In 7 Days

Run CRUD-RAG on a representative subset of your corpus to baseline your RAG pipeline.

Sweep chunk size and top-k per task: larger chunks/top-k for creative or multi‑doc QA; smaller chunks for extractive QA and error correction.

Compare BM25 vs dense vs hybrid+rerank on your queries; prefer hybrid+rerank for reasoning QA when budget allows.

Optimization Features

System Optimization

  • tune chunk size and overlap per task
  • select retriever type per scenario
  • adjust top-k for precision/recall tradeoff

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset focuses on Chinese news; results may not generalize to other languages or domains.
  • Many references and examples were generated with GPT-4, which risks model-generation bias in datasets.
  • Some LLM experiments ran on 1/5 of the data to control cost; full-scale behavior may differ.

When Not To Use

  • When you need domain-specific, high-assurance benchmarks (e.g., legal/medical) beyond news.
  • For structured/semi-structured retrieval tasks (tables, code), where this news-based corpus is not representative.
  • If you require multilingual evaluation—CRUD-RAG is Chinese-focused.

Failure Modes

  • Retriever mismatch: wrong documents retrieved, causing false but fluent answers.
  • Context overload: large top-k or large chunks can increase redundancy and lower precision.
  • Evaluator bias: RAGQuestEval depends on question generation and answerers that can inherit LLM biases.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • GPT-4-0613
  • GPT-4o
  • GPT-4o (reported)
  • GPT-4o (May 2024)
  • ChatGLM2-6B
  • Baichuan2-13B
  • Qwen-7B
  • Qwen-14B
  • Qwen2-7B

Metrics

  • BLEU
  • ROUGE-L
  • BERTScore
  • QuestEval
  • RAGQuestEval
  • MRR

Datasets

  • CRUD-RAG
  • UHGEval
  • RGB
  • Natural Questions (NQ)

Benchmarks

  • CRUD-RAG
  • RGB
  • NQ
  • ARES
  • RAGAS