A broad benchmark shows RAG systems remain vulnerable to data poisoning and current defenses only partially help

May 24, 20258 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu

Links

Abstract / PDF

Why It Matters For Business

If your product augments an LLM with an open or large text store, attackers who can add or edit that store can steer answers or cause refusals; naive defenses leave gaps and some robust fixes reduce product quality.

Summary TLDR

This paper introduces RSB, a unified benchmark that measures how text-based data-poisoning attacks affect Retrieval-Augmented Generation (RAG) systems. It evaluates 13 poisoning methods and 7 defenses across five standard QA datasets and two larger "expanded" versions per dataset (15 datasets total). Main takeaways: many simple attacks achieve high success on original datasets; expanding the knowledge base with many correct, similar passages (EX-M/EX-L) sharply reduces most attack success; a few attacks optimized per poisoned text (e.g., CRAG variants) keep working on richer databases; defenses help only in narrow cases (DoS) and hybrid filtering (TrustRAG) trades strong defense for large QI

Problem Statement

RAG systems reduce hallucinations by adding retrieved context, but their text knowledge stores can be poisoned. There was no systematic, comparable benchmark to measure how different poisoning attacks and defenses behave across many datasets and RAG variants.

Main Contribution

RSB benchmark: centralized evaluation of 13 poisoning attacks and 7 defenses across 15 dataset variants (5 QA datasets + EX-M and EX-L expansions).

Large empirical study: end-to-end tests with multiple LLMs, retrievers, similarity metrics, and advanced RAG frameworks (sequential, branching, conditional, loop), plus multi-turn, multimodal, and agent settings.

Clear, actionable findings: expanded knowledge reduces many attacks; some attacks (budget/auxiliary-LLM optimized) remain effective; existing defenses have large blind spots.

Key Findings

Most poisoning attacks work well on original QA datasets.

NumbersExample: BPI ASR = 0.94 on NQ (Table 2)

Attack success drops dramatically when the knowledge base is enriched with many correct, similar passages.

NumbersBPRAG ASR falls from 0.62 (NQ) to 0.03 (NQ-EX-L) (Table 2)

Per-text optimized attacks remain the strongest in dense, information-rich databases.

NumbersCRAG variants keep higher ASR on expanded sets vs. peers (reported higher ASR on EX-M/EX-L; Table 2)

Process-level defenses cut DoS-style attacks but fail against targeted poisoning; detection methods often miss crafted poisoned texts.

NumbersJamOracle ASR on NQ drops from 0.87 to ~0.01 under InstructRAG, while many targeted attacks keep ASR >0.5 under process-

Hybrid filtering (TrustRAG) reduces many ASRs but harms utility by removing benign evidence.

NumbersTrustRAG lowers targeted ASR to low values (Table 3) but often drops accuracy by >20% on expanded datasets (Appendix I,

Results

ASR (attack success rate)

ValueBPI ASR = 0.94 on NQ (original)

ASR (attack success rate)

ValueBPRAG ASR drops from 0.62 (NQ) to 0.03 (NQ-EX-L)

BaselineNQ (original)

Defense effectiveness

ValueJamOracle ASR 0.87 -> ~0.01 with InstructRAG on NQ

BaselineNo defense ASR = 0.87

Defense trade-off

ValueTrustRAG reduces many ASRs to low values but often lowers accuracy by >20% on expanded sets

BaselineNo defense accuracy

Similarity metric impact

ValueDot product retrieval raises ASR versus cosine in several attacks (e.g., BadRAG/Phantom perform much better under dot)

BaselineCosine similarity

Who Should Care

What To Try In 7 Days

Add redundant, relevant passages for high-risk queries (create EX-M style augmentation) and measure ASR drop.

Switch retriever scoring to cosine and re-evaluate attack surface (low-effort change with measurable effect).

Run TrustRAG or a hybrid filter in a staging environment to measure false-positive removals and utility loss before deploying widely.

Agent Features

Memory

  • retrieval memory (knowledge DB)

Frameworks

  • multi-turn conversational RAG
  • multimodal RAG
  • RAG-based LLM agents

Architectures

  • sequential RAG
  • branching RAG
  • conditional RAG
  • loop RAG

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Benchmark focuses on offline, text-based knowledge databases; web-linked or live-index settings are not covered.
  • Default proxy and judge model is GPT-4o-mini; some attack results vary with stronger proxy LLMs.
  • No public code or datasets are linked in the paper, limiting turnkey reproducibility.
  • Evaluations prioritize injection-by-text; attacker models that alter retriever weights are not fully explored.

When Not To Use

  • If your RAG system retrieves live web pages and cites URLs, attack surface differs and RSB's offline assumptions may not apply.
  • If your system uses proprietary closed retrievers with no public embedding model, some attack settings (white-box) are unrealistic.

Failure Modes

  • Hybrid filters (TrustRAG) may remove all retrieved context and cause severe accuracy loss (high false positives).
  • Per-text optimized attacks (CRAG variants) can bypass redundancy defenses by maximizing individual poisoned-text impact.
  • Perplexity or embedding-norm detectors show high false negatives for well-crafted poisoned texts.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4
  • GPT-4.1
  • Claude-3.7-Sonnet
  • Gemini-2.0-flash

Metrics

  • ACC
  • ASR
  • F1-score

Datasets

  • Natural Questions (NQ)
  • HotpotQA
  • MS-MARCO
  • SQuAD
  • BoolQ
  • EX-M (medium expansion)
  • EX-L (large expansion)

Benchmarks

  • RSB (this paper's RAG Security Bench)

Context Entities

Models

  • Llama-4Scout
  • DeepSeek-V3

Metrics

  • Perplexity-based detection (PPL)
  • Embedding norm detection (Norm)

Datasets

  • InfoSeek (multimodal evaluation set)