AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

July 17, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

9

Authors

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

Links

Abstract / PDF

Why It Matters For Business

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Summary TLDR

This paper introduces AGENTPOISON, a backdoor red-team attack that poisons a retrieval memory or RAG knowledge base with very few malicious demonstrations plus a short trigger token sequence. The trigger is optimized to push triggered queries into a compact embedding cluster so poisoned items are retrieved with high probability. On three real agents (driving, QA, healthcare) AGENTPOISON reaches ~81% retrieval success, ~59% trigger-induced action rate, and ~63% end-to-end real-world impact while degrading benign accuracy by ≲1% and using <0.1% poisoned instances. The trigger transfers across different dense retrievers and resists basic defenses like perplexity filtering and rephrasing.

Problem Statement

LLM agents use retrieval memory or RAG to fetch past examples. If an attacker can inject a few poisoned examples, the agent can be steered to dangerous outputs. Existing jailbreaks and backdoors fail to reliably force retrieval in RAG-based agents. We need a method to reliably make poisoned demonstrations be retrieved only when a covert trigger appears, without breaking benign behavior.

Main Contribution

A practical backdoor attack (AGENTPOISON) that poisons long-term memory or RAG knowledge bases with very few malicious demonstrations.

A constrained trigger-optimization method that maps triggered queries into a unique, compact embedding region to force retrieval of poisoned items.

Extensive evaluation on three real LLM agents (autonomous driving, multi-step QA, healthcare) showing high attack effectiveness, transferability, and resilience against simple defenses.

Open-source release of code and data to enable defensive research.

Key Findings

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

NumbersAverage ASR-r ≈ 81.2% (retrieval success)

The attack causes real agent-level harm in many cases.

NumbersAverage ASR-t ≈ 62.6% (end-to-end impact)

Benign utility stays almost unchanged under attack.

NumbersBenign performance drop ≈ 0.74% on average (ACC)

Triggers transfer across different dense embedders, including API-only embeddings.

NumbersTransfer observed across DPR, ANCE, BGE, REALM, ORQA and text-embedding-ada-002

Attack works with extremely small poisoning and short triggers.

NumbersASR-r ~62% with one poisoned instance; ~79% ASR-r with one-token trigger (reported averages)

AGENTPOISON resists simple defenses based on perplexity or rephrasing.

NumbersASR-t under defenses: Agent-Driver PPL filter 47.2%, rephrasing 50%; ReAct ~61–62%

Results

ASR-r (retrieval success rate)

Value≈81.2% average for AGENTPOISON

Baselinelower for CPA, GCG, BadChain

ASR-a (target action generation)

Value≈59.4% average for AGENTPOISON

ASR-t (end-to-end real-world impact)

Value≈62.6% average for AGENTPOISON

ACC (benign utility)

ValueBenign performance drop ≈0.74% on average

BaselineNon-attack case

One-instance / one-token attack

ValueASR-r ~62% with one poisoned instance; ~79% ASR-r with one-token trigger

Defense resilience (ASR-t under defenses)

ValueAgent-Driver: PPL filter 47.2%, rephrasing 50%; ReAct: ≈61–62%

BaselineGCG and BadChain lower under same defenses

Who Should Care

What To Try In 7 Days

Audit which systems can write to retrieval corpora and lock write access.

Run a red-team using public embedder to check retrieval integrity and trigger transferability.

Add provenance checks, stricter ingestion validation, and alerts when a small subset of retrieved items dominate decisions.

Agent Features

Memory

  • Long-term memory (key-value experiences)
  • Retrieval knowledge base (RAG)

Planning

  • Tool calling and action execution
  • LLM-based planning with retrieved in-context demos

Tool Use

  • External API/tool calls
  • Built-in tool execution (e.g., drive commands, DB ops)

Frameworks

  • ReAct
  • Agent-Driver
  • EHRAgent

Is Agentic

true

Architectures

  • RAG-enabled LLM agent
  • LLM backbone + retrieval encoder

Reproducibility

Data Urls

  • Agent-Driver dataset link referenced in paper; StrategyQA public dataset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Trigger optimization assumes white-box access to the retrieval embedder; transferability mitigates this but success can vary by embedder.
  • Experiments cover three agent types and dense retrievers; results may differ for sparse retrievers or other agent designs.
  • Some perturbations (letter-level edits) and stronger provenance defenses can reduce attack success.

When Not To Use

  • When retrieval corpora are closed, immutable, or cryptographically verified.
  • When only sparse lexical retrievers (BM25) are used without dense embeddings.
  • If you lack any capability to inject or modify the knowledge base.

Failure Modes

  • Trigger semantics destroyed by arbitrary token edits (letter injection) that change embedding cluster.
  • Robust provenance checks or write-restrictions block poisoned-instance injection.
  • Different embedder training data distributions may reduce transferability and ASR.

Core Entities

Models

  • GPT3.5
  • LLaMA3-70b
  • LLaMA3-8b
  • gpt-2 (surrogate)
  • text-embedding-ada-002
  • DPR
  • ANCE
  • BGE
  • REALM
  • ORQA

Metrics

  • ASR-r (retrieval success rate)
  • ASR-a (target action generation rate)
  • ASR-t (end-to-end attack impact)
  • Accuracy
  • PPL (perplexity)

Datasets

  • Agent-Driver memory dataset (23k experiences)
  • StrategyQA / 10k Wikipedia passages
  • EHRAgent (augmented 700 experiences)