RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

September 4, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and clearly shows weaknesses across models; numbers come from controlled experiments on news-based retrieval and judge-assisted scoring.

Citations52

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Who Should Care

Summary TLDR

This paper builds RGB, a bilingual (English/Chinese) benchmark that tests four core abilities needed for retrieval-augmented generation (RAG): noise robustness, negative rejection, information integration, and counterfactual robustness. The authors evaluate six LLMs (including ChatGPT) with search-retrieved documents and find RAG helps accuracy but important gaps remain: models degrade with noisy documents, often fail to refuse when evidence is missing, struggle to combine facts across documents, and are easily misled by false retrieved facts. RGB and error analyses point to document filtering, better context selection, and veracity checks as practical next steps.

Problem Statement

There is no systematic, cross-model evaluation showing which aspects of retrieval-augmented generation work or fail. Teams need a clear, model-agnostic diagnosis of how retrieved documents help or harm LLM answers.

Main Contribution

Created RGB, a bilingual benchmark (English/Chinese) focused on four RAG abilities: noise robustness, negative rejection, information integration, counterfactual robustness.

Evaluated six LLMs (ChatGPT, ChatGLM-6B, ChatGLM2-6B, Vicuna-7B, Qwen-7B-Chat, BELLE-7B) on RGB and quantified failures across the four abilities.

Key Findings

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

NumbersChatGPT accuracy 96.33%76.00% (noise ratio 00.8)

Practical UseFilter or rerank retrieved documents; high noise ratios require stronger retriever or downstream filtering before prompting the LLM.

Evidence RefTable 1 (Noise robustness: ChatGPT row)

Models often fail to refuse when retrieved docs lack the answer.

NumbersExact-match rejection ≤31% (English); judge-based rejection up to 45%

Practical UseAdd calibrated refusal prompts, verification steps, or uncertainty scoring to avoid confident but unsupported answers.

Evidence RefTable 3 (Negative rejection, Rej and Rej*)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy96.33% (noise 0) → 76.00% (noise 0.8)96.33% (noise 0)-20.33 ppRGB (Table 1)Accuracy falls steadily as negative-doc ratio increasesTable 1
Negative rejection (exact-match / judge)English exact 24.67% → judge 45.00% (ChatGPT)RGB Negative Rejection (Table 3)Models rarely output the prescribed refusal string; judge-based scoring shows somewhat higher but still low rejectionTable 3

What To Try In 7 Days

Add simple document reranking or heuristic filtering (date, source trust) before prompting the LLM.

Calibrate refusal prompts and test a small validation set to increase safe rejections for unknown facts.

Run quick multi-doc decomposition: split multi-part queries into sub-queries and aggregate answers.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

RGB uses news articles and search results, so it reflects web-news scenarios but may not cover specialized domains.

QA pairs and some ground truth were generated or assisted by ChatGPT, creating possible bias or leakage.

When Not To Use

When you need verified claims for high-stakes decisions without additional verification.

For domain-specific knowledge not well covered by web news (medical, legal) without domain retrieval.

Failure Modes

Hallucination driven by distant or weak evidence in retrieved docs.

Trusting and repeating false facts from retrieved documents.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)ChatGLM-6BChatGLM2-6BVicuna-7B-v1.3Qwen-7B-ChatBELLE-7B-2M

Metrics

Accuracyrejection rateerror detection rateerror correction rate

Datasets

RGB (Retrieval-Augmented Generation Benchmark)

Benchmarks

RGB