Overview
The survey and benchmark provide actionable tests and code, but evaluations are small (50-sample tests) and focused on QA; treat scores as indicative and re-run larger domain tests before trusting for production.
Citations9
Evidence Strength0.70
Confidence0.90
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 40%
Why It Matters For Business
RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.
Who Should Care
Summary TLDR
This paper defines six trustworthiness dimensions for Retrieval-Augmented Generation (RAG) systems—factuality, robustness, fairness, transparency, accountability and privacy—surveys prior work for each, and builds a small benchmark and evaluation pipeline to test 10 LLMs (open-source and proprietary) across those dimensions. Key takeaways: proprietary, instruction-tuned models tend to be more trustworthy on many axes; privacy and fairness remain weak; robustness varies widely; and current benchmarks are small and focused on QA. The authors release code and data pointers for reproducible evaluation.
Problem Statement
RAG systems feed external documents into LLMs to reduce hallucinations, but that same pipeline introduces new trust problems (wrong or poisoned retrievals, privacy leaks, bias amplification, opaque citations and brittle behavior). There is no unified framework or practical benchmark that measures these risks across retrieval, generation and evaluation.
Main Contribution
A unified framework defining six trustworthiness dimensions for RAG: factuality, robustness, fairness, transparency, accountability, and privacy.
A literature survey that organizes representative defenses, attacks and methods per dimension.
Key Findings
Proprietary models outperform most open-source models on trustworthiness metrics.
Instruction tuning and alignment improve many trust dimensions more than model size alone.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Factuality score (higher better) | GPT-3.5-turbo 40; GPT-4o 26; Llama2-13b-chat 4 | — | — | RGB subset (50 samples) | Table 2 (Sec. 4.2.1) | Table 2 |
| Robustness (relative drop) | Baichuan2-7b-chat -42.4%; Llama2-13b-chat -31.5%; GPT-4o -1.9% | — | — | HotpotQA (noise: 3 vs 10 refs) | Table 2; Sec. 4.1.2 | Table 2 |
What To Try In 7 Days
Run the paper's GitHub benchmark on your retriever+generator to get baseline trust scores.
Add a reranker and small refiner (summarizer) to reduce noisy retrievals and re-run robustness tests.
Test privacy by probing membership and extraction attacks on a non-sensitive sample of your corpus and log failures.
Optimization Features
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Benchmark uses small sample sets (50 items per test) which limits statistical reliability.
Evaluations focus on QA-style prompts and may not generalize to dialog, long-form or domain-specific tasks.
When Not To Use
Do not use these benchmark scores as the sole justification for high-stakes deployment without larger, domain-specific tests.
Avoid using their privacy scores as guarantees; real private data may behave differently under targeted attacks.
Failure Modes
Model follows misleading retrieved documents and outputs fabricated facts when retrieval contains poisoned or counterfactual passages.
Prompt or retrieval order and noise cause large accuracy drops for some models.

