RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and clearly shows weaknesses across models; numbers come from controlled experiments on news-based retrieval and judge-assisted scoring.

Citations52

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 25%

Production readiness: 40%

Novelty: 30%

Authors

Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

This paper builds RGB, a bilingual (English/Chinese) benchmark that tests four core abilities needed for retrieval-augmented generation (RAG): noise robustness, negative rejection, information integration, and counterfactual robustness. The authors evaluate six LLMs (including ChatGPT) with search-retrieved documents and find RAG helps accuracy but important gaps remain: models degrade with noisy documents, often fail to refuse when evidence is missing, struggle to combine facts across documents, and are easily misled by false retrieved facts. RGB and error analyses point to document filtering, better context selection, and veracity checks as practical next steps.

Problem Statement

There is no systematic, cross-model evaluation showing which aspects of retrieval-augmented generation work or fail. Teams need a clear, model-agnostic diagnosis of how retrieved documents help or harm LLM answers.

Main Contribution

Created RGB, a bilingual benchmark (English/Chinese) focused on four RAG abilities: noise robustness, negative rejection, information integration, counterfactual robustness.

Evaluated six LLMs (ChatGPT, ChatGLM-6B, ChatGLM2-6B, Vicuna-7B, Qwen-7B-Chat, BELLE-7B) on RGB and quantified failures across the four abilities.

Key Findings

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

NumbersChatGPT accuracy 96.33% → 76.00% (noise ratio 0→0.8)

Practical UseFilter or rerank retrieved documents; high noise ratios require stronger retriever or downstream filtering before prompting the LLM.

Evidence RefTable 1 (Noise robustness: ChatGPT row)

Models often fail to refuse when retrieved docs lack the answer.

NumbersExact-match rejection ≤31% (English); judge-based rejection up to 45%

Practical UseAdd calibrated refusal prompts, verification steps, or uncertainty scoring to avoid confident but unsupported answers.

Evidence RefTable 3 (Negative rejection, Rej and Rej*)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	96.33% (noise 0) → 76.00% (noise 0.8)	96.33% (noise 0)	-20.33 pp	RGB (Table 1)	Accuracy falls steadily as negative-doc ratio increases	Table 1
Negative rejection (exact-match / judge)	English exact 24.67% → judge 45.00% (ChatGPT)	—	—	RGB Negative Rejection (Table 3)	Models rarely output the prescribed refusal string; judge-based scoring shows somewhat higher but still low rejection	Table 3

What To Try In 7 Days

Add simple document reranking or heuristic filtering (date, source trust) before prompting the LLM.

Calibrate refusal prompts and test a small validation set to increase safe rejections for unknown facts.

Run quick multi-doc decomposition: split multi-part queries into sub-queries and aggregate answers.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/chen700564/RGB

Data URLs

https://github.com/chen700564/RGB

Risks & Boundaries

Limitations

RGB uses news articles and search results, so it reflects web-news scenarios but may not cover specialized domains.

QA pairs and some ground truth were generated or assisted by ChatGPT, creating possible bias or leakage.

When Not To Use

When you need verified claims for high-stakes decisions without additional verification.

For domain-specific knowledge not well covered by web news (medical, legal) without domain retrieval.

Failure Modes

Hallucination driven by distant or weak evidence in retrieved docs.

Trusting and repeating false facts from retrieved documents.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)ChatGLM-6BChatGLM2-6BVicuna-7B-v1.3Qwen-7B-ChatBELLE-7B-2M

Metrics

Accuracyrejection rateerror detection rateerror correction rate

Datasets

RGB (Retrieval-Augmented Generation Benchmark)

Benchmarks

RGB

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

Models often fail to refuse when retrieved docs lack the answer.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding