Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.3
Citation Count
9
Why It Matters For Business
RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.
Summary TLDR
This paper defines six trustworthiness dimensions for Retrieval-Augmented Generation (RAG) systems—factuality, robustness, fairness, transparency, accountability and privacy—surveys prior work for each, and builds a small benchmark and evaluation pipeline to test 10 LLMs (open-source and proprietary) across those dimensions. Key takeaways: proprietary, instruction-tuned models tend to be more trustworthy on many axes; privacy and fairness remain weak; robustness varies widely; and current benchmarks are small and focused on QA. The authors release code and data pointers for reproducible evaluation.
Problem Statement
RAG systems feed external documents into LLMs to reduce hallucinations, but that same pipeline introduces new trust problems (wrong or poisoned retrievals, privacy leaks, bias amplification, opaque citations and brittle behavior). There is no unified framework or practical benchmark that measures these risks across retrieval, generation and evaluation.
Main Contribution
A unified framework defining six trustworthiness dimensions for RAG: factuality, robustness, fairness, transparency, accountability, and privacy.
A literature survey that organizes representative defenses, attacks and methods per dimension.
A practical benchmark and evaluation pipeline (code published) that scores 10 LLMs across the six dimensions using small QA-based tests.
Key Findings
Proprietary models outperform most open-source models on trustworthiness metrics.
Instruction tuning and alignment improve many trust dimensions more than model size alone.
Privacy protection is weak across many models.
Robustness to noisy retrievals varies widely; some models collapse under moderate noise.
Citation and accountability performance can be high for some models but is uneven.
The paper's evaluations use small task samples (50 QA items per test) and narrow domains.
Results
Factuality score (higher better)
Robustness (relative drop)
Transparency score
Accountability (citation F1)
Privacy (refusal rate / protective behavior)
Who Should Care
What To Try In 7 Days
Run the paper's GitHub benchmark on your retriever+generator to get baseline trust scores.
Add a reranker and small refiner (summarizer) to reduce noisy retrievals and re-run robustness tests.
Test privacy by probing membership and extraction attacks on a non-sensitive sample of your corpus and log failures.
Optimization Features
Training Optimization
- instruction tuning
Inference Optimization
- isolate-then-aggregate (RobustRAG)
- REPLUG (merge per-doc outputs)
Reproducibility
Data Urls
- HotpotQA
- RGB (paper cites benchmark)
- CrowS-Pairs
- Enron Email dataset
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark uses small sample sets (50 items per test) which limits statistical reliability.
- Evaluations focus on QA-style prompts and may not generalize to dialog, long-form or domain-specific tasks.
- Some model evaluations are black-box; causes of poor scores may be dataset or prompt artifacts.
When Not To Use
- Do not use these benchmark scores as the sole justification for high-stakes deployment without larger, domain-specific tests.
- Avoid using their privacy scores as guarantees; real private data may behave differently under targeted attacks.
Failure Modes
- Model follows misleading retrieved documents and outputs fabricated facts when retrieval contains poisoned or counterfactual passages.
- Prompt or retrieval order and noise cause large accuracy drops for some models.
- Membership and prompt-injection attacks can extract private data even in black-box settings.
Core Entities
Models
- Llama2-7b
- Llama2-7b-chat
- Llama2-13b
- Llama2-13b-chat
- Baichuan2-7b-chat
- Baichuan2-13b-chat
- Qwen2-7b-instruct
- GLM-4-9b-chat
- GPT-3.5-turbo
- GPT-4o
Metrics
- F1
- Precision
- Recall
- Key-facts precision (NLI TRUE)
- Citation F1
- Privacy refusal rate
- Robustness % drop under noise
Datasets
- HotpotQA
- RGB
- RECALL
- CrowS-Pairs
- Enron Email
Benchmarks
- RAGBench
- RGB
- RECALL

