Overview
The benchmark is practical and clearly shows weaknesses across models; numbers come from controlled experiments on news-based retrieval and judge-assisted scoring.
Citations52
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 25%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.
Who Should Care
Summary TLDR
This paper builds RGB, a bilingual (English/Chinese) benchmark that tests four core abilities needed for retrieval-augmented generation (RAG): noise robustness, negative rejection, information integration, and counterfactual robustness. The authors evaluate six LLMs (including ChatGPT) with search-retrieved documents and find RAG helps accuracy but important gaps remain: models degrade with noisy documents, often fail to refuse when evidence is missing, struggle to combine facts across documents, and are easily misled by false retrieved facts. RGB and error analyses point to document filtering, better context selection, and veracity checks as practical next steps.
Problem Statement
There is no systematic, cross-model evaluation showing which aspects of retrieval-augmented generation work or fail. Teams need a clear, model-agnostic diagnosis of how retrieved documents help or harm LLM answers.
Main Contribution
Created RGB, a bilingual benchmark (English/Chinese) focused on four RAG abilities: noise robustness, negative rejection, information integration, counterfactual robustness.
Evaluated six LLMs (ChatGPT, ChatGLM-6B, ChatGLM2-6B, Vicuna-7B, Qwen-7B-Chat, BELLE-7B) on RGB and quantified failures across the four abilities.
Key Findings
Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.
Models often fail to refuse when retrieved docs lack the answer.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 96.33% (noise 0) → 76.00% (noise 0.8) | 96.33% (noise 0) | -20.33 pp | RGB (Table 1) | Accuracy falls steadily as negative-doc ratio increases | Table 1 |
| Negative rejection (exact-match / judge) | English exact 24.67% → judge 45.00% (ChatGPT) | — | — | RGB Negative Rejection (Table 3) | Models rarely output the prescribed refusal string; judge-based scoring shows somewhat higher but still low rejection | Table 3 |
What To Try In 7 Days
Add simple document reranking or heuristic filtering (date, source trust) before prompting the LLM.
Calibrate refusal prompts and test a small validation set to increase safe rejections for unknown facts.
Run quick multi-doc decomposition: split multi-part queries into sub-queries and aggregate answers.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
RGB uses news articles and search results, so it reflects web-news scenarios but may not cover specialized domains.
QA pairs and some ground truth were generated or assisted by ChatGPT, creating possible bias or leakage.
When Not To Use
When you need verified claims for high-stakes decisions without additional verification.
For domain-specific knowledge not well covered by web news (medical, legal) without domain retrieval.
Failure Modes
Hallucination driven by distant or weak evidence in retrieved docs.
Trusting and repeating false facts from retrieved documents.

