A practical survey and benchmark that measures factuality, robustness, fairness, transparency, accountability and privacy in RAG systems.

September 16, 20248 min

Overview

Decision SnapshotNeeds Validation

The survey and benchmark provide actionable tests and code, but evaluations are small (50-sample tests) and focused on QA; treat scores as indicative and re-run larger domain tests before trusting for production.

Citations9

Evidence Strength0.70

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.

Who Should Care

Summary TLDR

This paper defines six trustworthiness dimensions for Retrieval-Augmented Generation (RAG) systems—factuality, robustness, fairness, transparency, accountability and privacy—surveys prior work for each, and builds a small benchmark and evaluation pipeline to test 10 LLMs (open-source and proprietary) across those dimensions. Key takeaways: proprietary, instruction-tuned models tend to be more trustworthy on many axes; privacy and fairness remain weak; robustness varies widely; and current benchmarks are small and focused on QA. The authors release code and data pointers for reproducible evaluation.

Problem Statement

RAG systems feed external documents into LLMs to reduce hallucinations, but that same pipeline introduces new trust problems (wrong or poisoned retrievals, privacy leaks, bias amplification, opaque citations and brittle behavior). There is no unified framework or practical benchmark that measures these risks across retrieval, generation and evaluation.

Main Contribution

A unified framework defining six trustworthiness dimensions for RAG: factuality, robustness, fairness, transparency, accountability, and privacy.

A literature survey that organizes representative defenses, attacks and methods per dimension.

Key Findings

Proprietary models outperform most open-source models on trustworthiness metrics.

NumbersGPT-3.5 factuality=40 vs Llama2-13b-chat=4 (Table 2)

Practical UseIf you need higher factuality now, prefer commercial models or thoroughly instruction-tuned open models.

Evidence RefTable 2; Sec. 4.2.1

Instruction tuning and alignment improve many trust dimensions more than model size alone.

NumbersQwen2-7b-instruct transparency=58.9 vs Baichuan2-13b-chat=42.0 (Table 2)

Practical UsePrioritize instruction-tuned models or apply instruction tuning when trust matters more than raw parameter count.

Evidence RefTable 2; Sec. 4.2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Factuality score (higher better)GPT-3.5-turbo 40; GPT-4o 26; Llama2-13b-chat 4RGB subset (50 samples)Table 2 (Sec. 4.2.1)Table 2
Robustness (relative drop)Baichuan2-7b-chat -42.4%; Llama2-13b-chat -31.5%; GPT-4o -1.9%HotpotQA (noise: 3 vs 10 refs)Table 2; Sec. 4.1.2Table 2

What To Try In 7 Days

Run the paper's GitHub benchmark on your retriever+generator to get baseline trust scores.

Add a reranker and small refiner (summarizer) to reduce noisy retrievals and re-run robustness tests.

Test privacy by probing membership and extraction attacks on a non-sensitive sample of your corpus and log failures.

Optimization Features

Training Optimization
instruction tuning
Inference Optimization
isolate-then-aggregate (RobustRAG)REPLUG (merge per-doc outputs)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQARGB (paper cites benchmark)CrowS-PairsEnron Email dataset

Risks & Boundaries

Limitations

Benchmark uses small sample sets (50 items per test) which limits statistical reliability.

Evaluations focus on QA-style prompts and may not generalize to dialog, long-form or domain-specific tasks.

When Not To Use

Do not use these benchmark scores as the sole justification for high-stakes deployment without larger, domain-specific tests.

Avoid using their privacy scores as guarantees; real private data may behave differently under targeted attacks.

Failure Modes

Model follows misleading retrieved documents and outputs fabricated facts when retrieval contains poisoned or counterfactual passages.

Prompt or retrieval order and noise cause large accuracy drops for some models.

Core Entities

Models

Llama2-7bLlama2-7b-chatLlama2-13bLlama2-13b-chatBaichuan2-7b-chatBaichuan2-13b-chatQwen2-7b-instructGLM-4-9b-chatGPT-3.5-turboGPT-4o

Metrics

F1PrecisionRecallKey-facts precision (NLI TRUE)Citation F1Privacy refusal rateRobustness % drop under noise

Datasets

HotpotQARGBRECALLCrowS-PairsEnron Email

Benchmarks

RAGBenchRGBRECALL