DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

June 9, 20247 min

Overview

Decision SnapshotReady For Pilot

The dataset and experiments provide practical, reproducible evidence that RAG helps domain tasks; however, single-domain scope and absent comparisons with advanced RAG pipelines limit generality.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 45%

Authors

Shuting Wang, Jiongnan Liu, Shiren Song, Jiehan Cheng, Yuqi Fu, Peidong Guo, Kun Fang, Yutao Zhu, Zhicheng Dou

Links

Abstract / PDF / Data

Why It Matters For Business

For domain or expert apps, adding domain documents to an LLM pipeline is essential: verified references can move a model from near-random to high accuracy on factual Q&A.

Who Should Care

Summary TLDR

The authors release DomainRAG, a Chinese benchmark and corpora (HTML + text) for testing Retrieval-Augmented Generation (RAG) in a real domain: university enrollment. They split evaluation into seven sub-datasets that target six practical abilities: conversational intent, structural HTML understanding, faithfulness to external expert docs, denoising/noise-robustness, time-sensitive answers, and multi-document integration. Experiments on seven LLMs show: closed-book LLMs often fail on domain questions, retrieval (especially BM25) sharply improves accuracy, some models use HTML structure better, and models still struggle with multi-turn and multi-doc scenarios and with ordering/noise in the IR

Problem Statement

General RAG evaluations use open-domain sources (e.g., Wikipedia) that can be memorized by LLMs. That leaves open whether RAG helps LLMs in real domain, long-tail, and time-sensitive expert tasks. We need a benchmark that uses in-domain documents and targeted tests for practical RAG abilities.

Main Contribution

DomainRAG: a Chinese, in-domain benchmark for college-enrollment tasks with HTML and text corpora.

Seven sub-datasets targeting six practical RAG abilities: extractive QA, conversational QA, structural (HTML/table) QA, faithful QA (with anti-references), noisy QA, time-sensitive QA, and multi-document QA.

Key Findings

Providing golden domain references massively improves exact-match (EM) accuracy on extractive questions.

NumbersGPT-3.5 EM: closed-book 0.1929 → golden 0.9233 (extractive, Table 2)

Practical UseAlways feed verified domain documents to your LLM pipeline; accuracy can jump several-fold compared to closed-book answers.

Evidence RefTable 2

BM25 sparse retrieval outperforms the tested dense retriever in this long-tail domain.

NumbersGPT-3.5 EM, BM25 TOP3 0.7835 vs Dense TOP3 0.6921 (extractive, Table 2)

Practical UseUse a strong sparse retriever (BM25) first for cost-sensitive, long-tail domains before investing in dense retrievers.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Extractive EM (closed-book)GPT-3.5 0.1929 (extractive, Table 2)closed-bookExtractiveGPT-3.5 closed-book EM = 0.1929Table 2
Extractive EM (golden reference)GPT-3.5 0.9233 (extractive, Table 2)closed-book 0.1929+0.7304 EMExtractiveGPT-3.5 golden-reference EM = 0.9233Table 2

What To Try In 7 Days

Run a closed-book vs retrieved test using a small in-domain corpus to measure retrieval uplift.

Compare BM25 and a dense retriever on your domain; pick BM25 first for long-tail content and lower cost.

Test feeding HTML vs plain text for your strongest model and keep the format that yields better answers.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single application domain (Chinese university enrollment) may bias results and limit generality.

Did not evaluate more complex or production RAG frameworks and pipelines.

When Not To Use

Do not assume DomainRAG results generalize to very different domains without tests.

Avoid using the benchmark as the sole sign-off for production RAG systems in other industries.

Failure Modes

Model hallucinations when external knowledge is absent or noisy.

Lost-in-the-middle: relevant reference buried amid noisy docs reduces accuracy.

Core Entities

Models

Llama2-7B-chatLlama2-13B-chatLlama2-70B-chatBaichuan2-7B-chatBaichuan2-33B-32kChatGLM2-6B-32kGPT-3.5-turbo-1106

Metrics

EMEMSF1Rouge-LGPT-4 evaluation (GE)

Datasets

DomainRAG (Extractive, Conversational, Structural, Faithful, Noisy, Time-sensitive, Multi-document)HTML corpus (1,686 web pages)Text corpus (14,406 passages)

Benchmarks

DomainRAG