Overview
The dataset and experiments provide practical, reproducible evidence that RAG helps domain tasks; however, single-domain scope and absent comparisons with advanced RAG pipelines limit generality.
Citations4
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 2/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
For domain or expert apps, adding domain documents to an LLM pipeline is essential: verified references can move a model from near-random to high accuracy on factual Q&A.
Who Should Care
Summary TLDR
The authors release DomainRAG, a Chinese benchmark and corpora (HTML + text) for testing Retrieval-Augmented Generation (RAG) in a real domain: university enrollment. They split evaluation into seven sub-datasets that target six practical abilities: conversational intent, structural HTML understanding, faithfulness to external expert docs, denoising/noise-robustness, time-sensitive answers, and multi-document integration. Experiments on seven LLMs show: closed-book LLMs often fail on domain questions, retrieval (especially BM25) sharply improves accuracy, some models use HTML structure better, and models still struggle with multi-turn and multi-doc scenarios and with ordering/noise in the IR
Problem Statement
General RAG evaluations use open-domain sources (e.g., Wikipedia) that can be memorized by LLMs. That leaves open whether RAG helps LLMs in real domain, long-tail, and time-sensitive expert tasks. We need a benchmark that uses in-domain documents and targeted tests for practical RAG abilities.
Main Contribution
DomainRAG: a Chinese, in-domain benchmark for college-enrollment tasks with HTML and text corpora.
Seven sub-datasets targeting six practical RAG abilities: extractive QA, conversational QA, structural (HTML/table) QA, faithful QA (with anti-references), noisy QA, time-sensitive QA, and multi-document QA.
Key Findings
Providing golden domain references massively improves exact-match (EM) accuracy on extractive questions.
BM25 sparse retrieval outperforms the tested dense retriever in this long-tail domain.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Extractive EM (closed-book) | GPT-3.5 0.1929 (extractive, Table 2) | closed-book | — | Extractive | GPT-3.5 closed-book EM = 0.1929 | Table 2 |
| Extractive EM (golden reference) | GPT-3.5 0.9233 (extractive, Table 2) | closed-book 0.1929 | +0.7304 EM | Extractive | GPT-3.5 golden-reference EM = 0.9233 | Table 2 |
What To Try In 7 Days
Run a closed-book vs retrieved test using a small in-domain corpus to measure retrieval uplift.
Compare BM25 and a dense retriever on your domain; pick BM25 first for long-tail content and lower cost.
Test feeding HTML vs plain text for your strongest model and keep the format that yields better answers.
Reproducibility
Risks & Boundaries
Limitations
Single application domain (Chinese university enrollment) may bias results and limit generality.
Did not evaluate more complex or production RAG frameworks and pipelines.
When Not To Use
Do not assume DomainRAG results generalize to very different domains without tests.
Avoid using the benchmark as the sole sign-off for production RAG systems in other industries.
Failure Modes
Model hallucinations when external knowledge is absent or noisy.
Lost-in-the-middle: relevant reference buried amid noisy docs reduces accuracy.

