DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Overview

Decision SnapshotReady For Pilot

The dataset and experiments provide practical, reproducible evidence that RAG helps domain tasks; however, single-domain scope and absent comparisons with advanced RAG pipelines limit generality.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 2/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 45%

Authors

Shuting Wang, Jiongnan Liu, Shiren Song, Jiehan Cheng, Yuqi Fu, Peidong Guo, Kun Fang, Yutao Zhu, Zhicheng Dou

Links

Abstract / PDF / Data

Why It Matters For Business

For domain or expert apps, adding domain documents to an LLM pipeline is essential: verified references can move a model from near-random to high accuracy on factual Q&A.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

The authors release DomainRAG, a Chinese benchmark and corpora (HTML + text) for testing Retrieval-Augmented Generation (RAG) in a real domain: university enrollment. They split evaluation into seven sub-datasets that target six practical abilities: conversational intent, structural HTML understanding, faithfulness to external expert docs, denoising/noise-robustness, time-sensitive answers, and multi-document integration. Experiments on seven LLMs show: closed-book LLMs often fail on domain questions, retrieval (especially BM25) sharply improves accuracy, some models use HTML structure better, and models still struggle with multi-turn and multi-doc scenarios and with ordering/noise in the IR

Problem Statement

General RAG evaluations use open-domain sources (e.g., Wikipedia) that can be memorized by LLMs. That leaves open whether RAG helps LLMs in real domain, long-tail, and time-sensitive expert tasks. We need a benchmark that uses in-domain documents and targeted tests for practical RAG abilities.

Main Contribution

DomainRAG: a Chinese, in-domain benchmark for college-enrollment tasks with HTML and text corpora.

Seven sub-datasets targeting six practical RAG abilities: extractive QA, conversational QA, structural (HTML/table) QA, faithful QA (with anti-references), noisy QA, time-sensitive QA, and multi-document QA.

Key Findings

Providing golden domain references massively improves exact-match (EM) accuracy on extractive questions.

NumbersGPT-3.5 EM: closed-book 0.1929 → golden 0.9233 (extractive, Table 2)

Practical UseAlways feed verified domain documents to your LLM pipeline; accuracy can jump several-fold compared to closed-book answers.

Evidence RefTable 2

BM25 sparse retrieval outperforms the tested dense retriever in this long-tail domain.

NumbersGPT-3.5 EM, BM25 TOP3 0.7835 vs Dense TOP3 0.6921 (extractive, Table 2)

Practical UseUse a strong sparse retriever (BM25) first for cost-sensitive, long-tail domains before investing in dense retrievers.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Extractive EM (closed-book)	GPT-3.5 0.1929 (extractive, Table 2)	closed-book	—	Extractive	GPT-3.5 closed-book EM = 0.1929	Table 2
Extractive EM (golden reference)	GPT-3.5 0.9233 (extractive, Table 2)	closed-book 0.1929	+0.7304 EM	Extractive	GPT-3.5 golden-reference EM = 0.9233	Table 2

What To Try In 7 Days

Run a closed-book vs retrieved test using a small in-domain corpus to measure retrieval uplift.

Compare BM25 and a dense retriever on your domain; pick BM25 first for long-tail content and lower cost.

Test feeding HTML vs plain text for your strongest model and keep the format that yields better answers.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/ShootingWong/DomainRAG

Risks & Boundaries

Limitations

Single application domain (Chinese university enrollment) may bias results and limit generality.

Did not evaluate more complex or production RAG frameworks and pipelines.

When Not To Use

Do not assume DomainRAG results generalize to very different domains without tests.

Avoid using the benchmark as the sole sign-off for production RAG systems in other industries.

Failure Modes

Model hallucinations when external knowledge is absent or noisy.

Lost-in-the-middle: relevant reference buried amid noisy docs reduces accuracy.

Core Entities

Models

Llama2-7B-chatLlama2-13B-chatLlama2-70B-chatBaichuan2-7B-chatBaichuan2-33B-32kChatGLM2-6B-32kGPT-3.5-turbo-1106

Metrics

EMEMSF1Rouge-LGPT-4 evaluation (GE)

Datasets

DomainRAG (Extractive, Conversational, Structural, Faithful, Noisy, Time-sensitive, Multi-document)HTML corpus (1,686 web pages)Text corpus (14,406 passages)

Benchmarks

DomainRAG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Providing golden domain references massively improves exact-match (EM) accuracy on extractive questions.

BM25 sparse retrieval outperforms the tested dense retriever in this long-tail domain.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding