A broad benchmark shows RAG systems remain vulnerable to data poisoning and current defenses only partially help

May 24, 20258 min

Overview

Decision SnapshotNeeds Validation

The study is broad and uses multiple datasets, LLMs, retrievers, and defense variants; results are empirical and reproducible in concept, but exact code/data/thresholds are not provided so applying results requires re-implementation.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu

Links

Abstract / PDF

Why It Matters For Business

If your product augments an LLM with an open or large text store, attackers who can add or edit that store can steer answers or cause refusals; naive defenses leave gaps and some robust fixes reduce product quality.

Who Should Care

Summary TLDR

This paper introduces RSB, a unified benchmark that measures how text-based data-poisoning attacks affect Retrieval-Augmented Generation (RAG) systems. It evaluates 13 poisoning methods and 7 defenses across five standard QA datasets and two larger "expanded" versions per dataset (15 datasets total). Main takeaways: many simple attacks achieve high success on original datasets; expanding the knowledge base with many correct, similar passages (EX-M/EX-L) sharply reduces most attack success; a few attacks optimized per poisoned text (e.g., CRAG variants) keep working on richer databases; defenses help only in narrow cases (DoS) and hybrid filtering (TrustRAG) trades strong defense for large QI

Problem Statement

RAG systems reduce hallucinations by adding retrieved context, but their text knowledge stores can be poisoned. There was no systematic, comparable benchmark to measure how different poisoning attacks and defenses behave across many datasets and RAG variants.

Main Contribution

RSB benchmark: centralized evaluation of 13 poisoning attacks and 7 defenses across 15 dataset variants (5 QA datasets + EX-M and EX-L expansions).

Large empirical study: end-to-end tests with multiple LLMs, retrievers, similarity metrics, and advanced RAG frameworks (sequential, branching, conditional, loop), plus multi-turn, multimodal, and agent settings.

Key Findings

Most poisoning attacks work well on original QA datasets.

NumbersExample: BPI ASR = 0.94 on NQ (Table 2)

Practical UseAssume an attacker who can inject texts into a standard knowledge DB can often push attacker-chosen answers; hardening is needed before exposing RAG to uncurated sources.

Evidence RefTable 2

Attack success drops dramatically when the knowledge base is enriched with many correct, similar passages.

NumbersBPRAG ASR falls from 0.62 (NQ) to 0.03 (NQ-EX-L) (Table 2)

Practical UseAdding redundant, high-quality supporting passages for queries is a cheap passive defense: fill DB with diverse correct texts to reduce poisoning impact.

Evidence RefTable 2; Appendix G

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR (attack success rate)BPI ASR = 0.94 on NQ (original)NQ (original)Table 2: NQ rowTable 2
ASR (attack success rate)BPRAG ASR drops from 0.62 (NQ) to 0.03 (NQ-EX-L)NQ (original)-0.59 ASRNQ -> NQ-EX-LTable 2; Section 5.2.1Table 2

What To Try In 7 Days

Add redundant, relevant passages for high-risk queries (create EX-M style augmentation) and measure ASR drop.

Switch retriever scoring to cosine and re-evaluate attack surface (low-effort change with measurable effect).

Run TrustRAG or a hybrid filter in a staging environment to measure false-positive removals and utility loss before deploying widely.

Agent Features

Memory
retrieval memory (knowledge DB)
Frameworks
multi-turn conversational RAGmultimodal RAGRAG-based LLM agents
Architectures
sequential RAGbranching RAGconditional RAGloop RAG

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark focuses on offline, text-based knowledge databases; web-linked or live-index settings are not covered.

Default proxy and judge model is GPT-4o-mini; some attack results vary with stronger proxy LLMs.

When Not To Use

If your RAG system retrieves live web pages and cites URLs, attack surface differs and RSB's offline assumptions may not apply.

If your system uses proprietary closed retrievers with no public embedding model, some attack settings (white-box) are unrealistic.

Failure Modes

Hybrid filters (TrustRAG) may remove all retrieved context and cause severe accuracy loss (high false positives).

Per-text optimized attacks (CRAG variants) can bypass redundancy defenses by maximizing individual poisoned-text impact.

Core Entities

Models

GPT-4o-miniGPT-4GPT-4.1Claude-3.7-SonnetGemini-2.0-flash

Metrics

ACCASRF1-score

Datasets

Natural Questions (NQ)HotpotQAMS-MARCOSQuADBoolQEX-M (medium expansion)EX-L (large expansion)

Benchmarks

RSB (this paper's RAG Security Bench)

Context Entities

Models

Llama-4ScoutDeepSeek-V3

Metrics

Perplexity-based detection (PPL)Embedding norm detection (Norm)

Datasets

InfoSeek (multimodal evaluation set)