AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

Overview

Decision SnapshotReady For Pilot

The method is practical: it needs no model retraining, works with <0.1% poisoning, transfers across embedders, and is validated on three real agents; evidence is solid but focused on dense retrievers and selected agent systems.

Citations9

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces AGENTPOISON, a backdoor red-team attack that poisons a retrieval memory or RAG knowledge base with very few malicious demonstrations plus a short trigger token sequence. The trigger is optimized to push triggered queries into a compact embedding cluster so poisoned items are retrieved with high probability. On three real agents (driving, QA, healthcare) AGENTPOISON reaches ~81% retrieval success, ~59% trigger-induced action rate, and ~63% end-to-end real-world impact while degrading benign accuracy by ≲1% and using <0.1% poisoned instances. The trigger transfers across different dense retrievers and resists basic defenses like perplexity filtering and rephrasing.

Problem Statement

LLM agents use retrieval memory or RAG to fetch past examples. If an attacker can inject a few poisoned examples, the agent can be steered to dangerous outputs. Existing jailbreaks and backdoors fail to reliably force retrieval in RAG-based agents. We need a method to reliably make poisoned demonstrations be retrieved only when a covert trigger appears, without breaking benign behavior.

Main Contribution

A practical backdoor attack (AGENTPOISON) that poisons long-term memory or RAG knowledge bases with very few malicious demonstrations.

A constrained trigger-optimization method that maps triggered queries into a unique, compact embedding region to force retrieval of poisoned items.

Key Findings

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

NumbersAverage ASR-r ≈ 81.2% (retrieval success)

Practical UseWith only a tiny number of poisoned entries, attackers can reliably make agents fetch malicious examples whenever the trigger is present, so defenders must protect retrieval corpora.

Evidence RefTable 1; Sec. 4.2

The attack causes real agent-level harm in many cases.

NumbersAverage ASR-t ≈ 62.6% (end-to-end impact)

Practical UsePoisoning memory can translate into physical or safety-critical failures (e.g., sudden stop in driving). Treat memory integrity as a security boundary.

Evidence RefTable 1; Sec. 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR-r (retrieval success rate)	≈81.2% average for AGENTPOISON	lower for CPA, GCG, BadChain	—	aggregate across three agents	Table 1; Sec. 4.2	Table 1
ASR-a (target action generation)	≈59.4% average for AGENTPOISON	—	—	aggregate across three agents	Sec. 4.2; Table 1	Table 1

What To Try In 7 Days

Audit which systems can write to retrieval corpora and lock write access.

Run a red-team using public embedder to check retrieval integrity and trigger transferability.

Add provenance checks, stricter ingestion validation, and alerts when a small subset of retrieved items dominate decisions.

Agent Features

Memory

Long-term memory (key-value experiences)Retrieval knowledge base (RAG)

Planning

Tool calling and action executionLLM-based planning with retrieved in-context demos

Tool Use

External API/tool callsBuilt-in tool execution (e.g., drive commands, DB ops)

Frameworks

ReActAgent-DriverEHRAgent

Is Agentic

Yes

Architectures

RAG-enabled LLM agentLLM backbone + retrieval encoder

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BillChan226/AgentPoison

Data URLs

Agent-Driver dataset link referenced in paper; StrategyQA public dataset

Risks & Boundaries

Limitations

Trigger optimization assumes white-box access to the retrieval embedder; transferability mitigates this but success can vary by embedder.

Experiments cover three agent types and dense retrievers; results may differ for sparse retrievers or other agent designs.

When Not To Use

When retrieval corpora are closed, immutable, or cryptographically verified.

When only sparse lexical retrievers (BM25) are used without dense embeddings.

Failure Modes

Trigger semantics destroyed by arbitrary token edits (letter injection) that change embedding cluster.

Robust provenance checks or write-restrictions block poisoned-instance injection.

Core Entities

Models

GPT3.5LLaMA3-70bLLaMA3-8bgpt-2 (surrogate)text-embedding-ada-002DPRANCEBGEREALMORQA

Metrics

ASR-r (retrieval success rate)ASR-a (target action generation rate)ASR-t (end-to-end attack impact)AccuracyPPL (perplexity)

Datasets

Agent-Driver memory dataset (23k experiences)StrategyQA / 10k Wikipedia passagesEHRAgent (augmented 700 experiences)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

The attack causes real agent-level harm in many cases.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding

A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

Key finding

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

Key finding

Use LLM agents plus DRL and tiny adapters to meet operator intents while cutting active radio units and memory use

Key finding