A practical survey and benchmark that measures factuality, robustness, fairness, transparency, accountability and privacy in RAG systems.

September 16, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.3

Citation Count

9

Authors

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu

Links

Abstract / PDF

Why It Matters For Business

RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.

Summary TLDR

This paper defines six trustworthiness dimensions for Retrieval-Augmented Generation (RAG) systems—factuality, robustness, fairness, transparency, accountability and privacy—surveys prior work for each, and builds a small benchmark and evaluation pipeline to test 10 LLMs (open-source and proprietary) across those dimensions. Key takeaways: proprietary, instruction-tuned models tend to be more trustworthy on many axes; privacy and fairness remain weak; robustness varies widely; and current benchmarks are small and focused on QA. The authors release code and data pointers for reproducible evaluation.

Problem Statement

RAG systems feed external documents into LLMs to reduce hallucinations, but that same pipeline introduces new trust problems (wrong or poisoned retrievals, privacy leaks, bias amplification, opaque citations and brittle behavior). There is no unified framework or practical benchmark that measures these risks across retrieval, generation and evaluation.

Main Contribution

A unified framework defining six trustworthiness dimensions for RAG: factuality, robustness, fairness, transparency, accountability, and privacy.

A literature survey that organizes representative defenses, attacks and methods per dimension.

A practical benchmark and evaluation pipeline (code published) that scores 10 LLMs across the six dimensions using small QA-based tests.

Key Findings

Proprietary models outperform most open-source models on trustworthiness metrics.

NumbersGPT-3.5 factuality=40 vs Llama2-13b-chat=4 (Table 2)

Instruction tuning and alignment improve many trust dimensions more than model size alone.

NumbersQwen2-7b-instruct transparency=58.9 vs Baichuan2-13b-chat=42.0 (Table 2)

Privacy protection is weak across many models.

NumbersLlama2-7b privacy=0; GPT-3.5 privacy=0 (Table 2)

Robustness to noisy retrievals varies widely; some models collapse under moderate noise.

NumbersRobustness drops: Baichuan2-7b-chat -42.4% vs GPT-4o -1.9% (Table 2)

Citation and accountability performance can be high for some models but is uneven.

NumbersGPT-4o accountability=77.6, GPT-3.5=60.1, many open models <50 (Table 2)

The paper's evaluations use small task samples (50 QA items per test) and narrow domains.

NumbersFactuality/robustness/transparency/fairness/privacy tests use 50 samples each (Sec. 4.1)

Results

Factuality score (higher better)

ValueGPT-3.5-turbo 40; GPT-4o 26; Llama2-13b-chat 4

Robustness (relative drop)

ValueBaichuan2-7b-chat -42.4%; Llama2-13b-chat -31.5%; GPT-4o -1.9%

Transparency score

ValueQwen2-7b-instruct 58.9; GPT-3.5-turbo 61.2; Baichuan2-13b-chat 42.0

Accountability (citation F1)

ValueGPT-4o 77.6; GPT-3.5-turbo 60.1; GLM-4-9b-chat 50.6

Privacy (refusal rate / protective behavior)

ValueLlama2-7b 0; Llama2-13b 0; GPT-3.5-turbo 0; Llama2-7b-chat 46; GPT-4o 4

Who Should Care

What To Try In 7 Days

Run the paper's GitHub benchmark on your retriever+generator to get baseline trust scores.

Add a reranker and small refiner (summarizer) to reduce noisy retrievals and re-run robustness tests.

Test privacy by probing membership and extraction attacks on a non-sensitive sample of your corpus and log failures.

Optimization Features

Training Optimization

  • instruction tuning

Inference Optimization

  • isolate-then-aggregate (RobustRAG)
  • REPLUG (merge per-doc outputs)

Reproducibility

Data Urls

  • HotpotQA
  • RGB (paper cites benchmark)
  • CrowS-Pairs
  • Enron Email dataset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark uses small sample sets (50 items per test) which limits statistical reliability.
  • Evaluations focus on QA-style prompts and may not generalize to dialog, long-form or domain-specific tasks.
  • Some model evaluations are black-box; causes of poor scores may be dataset or prompt artifacts.

When Not To Use

  • Do not use these benchmark scores as the sole justification for high-stakes deployment without larger, domain-specific tests.
  • Avoid using their privacy scores as guarantees; real private data may behave differently under targeted attacks.

Failure Modes

  • Model follows misleading retrieved documents and outputs fabricated facts when retrieval contains poisoned or counterfactual passages.
  • Prompt or retrieval order and noise cause large accuracy drops for some models.
  • Membership and prompt-injection attacks can extract private data even in black-box settings.

Core Entities

Models

  • Llama2-7b
  • Llama2-7b-chat
  • Llama2-13b
  • Llama2-13b-chat
  • Baichuan2-7b-chat
  • Baichuan2-13b-chat
  • Qwen2-7b-instruct
  • GLM-4-9b-chat
  • GPT-3.5-turbo
  • GPT-4o

Metrics

  • F1
  • Precision
  • Recall
  • Key-facts precision (NLI TRUE)
  • Citation F1
  • Privacy refusal rate
  • Robustness % drop under noise

Datasets

  • HotpotQA
  • RGB
  • RECALL
  • CrowS-Pairs
  • Enron Email

Benchmarks

  • RAGBench
  • RGB
  • RECALL