A practical survey and benchmark that measures factuality, robustness, fairness, transparency, accountability and privacy in RAG systems.

Overview

Decision SnapshotNeeds Validation

The survey and benchmark provide actionable tests and code, but evaluations are small (50-sample tests) and focused on QA; treat scores as indicative and re-run larger domain tests before trusting for production.

Citations9

Evidence Strength0.70

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

This paper defines six trustworthiness dimensions for Retrieval-Augmented Generation (RAG) systems—factuality, robustness, fairness, transparency, accountability and privacy—surveys prior work for each, and builds a small benchmark and evaluation pipeline to test 10 LLMs (open-source and proprietary) across those dimensions. Key takeaways: proprietary, instruction-tuned models tend to be more trustworthy on many axes; privacy and fairness remain weak; robustness varies widely; and current benchmarks are small and focused on QA. The authors release code and data pointers for reproducible evaluation.

Problem Statement

RAG systems feed external documents into LLMs to reduce hallucinations, but that same pipeline introduces new trust problems (wrong or poisoned retrievals, privacy leaks, bias amplification, opaque citations and brittle behavior). There is no unified framework or practical benchmark that measures these risks across retrieval, generation and evaluation.

Main Contribution

A unified framework defining six trustworthiness dimensions for RAG: factuality, robustness, fairness, transparency, accountability, and privacy.

A literature survey that organizes representative defenses, attacks and methods per dimension.

Key Findings

Proprietary models outperform most open-source models on trustworthiness metrics.

NumbersGPT-3.5 factuality=40 vs Llama2-13b-chat=4 (Table 2)

Practical UseIf you need higher factuality now, prefer commercial models or thoroughly instruction-tuned open models.

Evidence RefTable 2; Sec. 4.2.1

Instruction tuning and alignment improve many trust dimensions more than model size alone.

NumbersQwen2-7b-instruct transparency=58.9 vs Baichuan2-13b-chat=42.0 (Table 2)

Practical UsePrioritize instruction-tuned models or apply instruction tuning when trust matters more than raw parameter count.

Evidence RefTable 2; Sec. 4.2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Factuality score (higher better)	GPT-3.5-turbo 40; GPT-4o 26; Llama2-13b-chat 4	—	—	RGB subset (50 samples)	Table 2 (Sec. 4.2.1)	Table 2
Robustness (relative drop)	Baichuan2-7b-chat -42.4%; Llama2-13b-chat -31.5%; GPT-4o -1.9%	—	—	HotpotQA (noise: 3 vs 10 refs)	Table 2; Sec. 4.1.2	Table 2

What To Try In 7 Days

Run the paper's GitHub benchmark on your retriever+generator to get baseline trust scores.

Add a reranker and small refiner (summarizer) to reduce noisy retrievals and re-run robustness tests.

Test privacy by probing membership and extraction attacks on a non-sensitive sample of your corpus and log failures.

Optimization Features

Training Optimization

instruction tuning

Inference Optimization

isolate-then-aggregate (RobustRAG)REPLUG (merge per-doc outputs)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/smallporridge/TrustworthyRAG

Data URLs

HotpotQARGB (paper cites benchmark)CrowS-PairsEnron Email dataset

Risks & Boundaries

Limitations

Benchmark uses small sample sets (50 items per test) which limits statistical reliability.

Evaluations focus on QA-style prompts and may not generalize to dialog, long-form or domain-specific tasks.

When Not To Use

Do not use these benchmark scores as the sole justification for high-stakes deployment without larger, domain-specific tests.

Avoid using their privacy scores as guarantees; real private data may behave differently under targeted attacks.

Failure Modes

Model follows misleading retrieved documents and outputs fabricated facts when retrieval contains poisoned or counterfactual passages.

Prompt or retrieval order and noise cause large accuracy drops for some models.

Core Entities

Models

Llama2-7bLlama2-7b-chatLlama2-13bLlama2-13b-chatBaichuan2-7b-chatBaichuan2-13b-chatQwen2-7b-instructGLM-4-9b-chatGPT-3.5-turboGPT-4o

Metrics

F1PrecisionRecallKey-facts precision (NLI TRUE)Citation F1Privacy refusal rateRobustness % drop under noise

Datasets

HotpotQARGBRECALLCrowS-PairsEnron Email

Benchmarks

RAGBenchRGBRECALL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Proprietary models outperform most open-source models on trustworthiness metrics.

Instruction tuning and alignment improve many trust dimensions more than model size alone.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding