Overview
The method is practical: it needs no model retraining, works with <0.1% poisoning, transfers across embedders, and is validated on three real agents; evidence is solid but focused on dense retrievers and selected agent systems.
Citations9
Evidence Strength0.80
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.
Who Should Care
Summary TLDR
This paper introduces AGENTPOISON, a backdoor red-team attack that poisons a retrieval memory or RAG knowledge base with very few malicious demonstrations plus a short trigger token sequence. The trigger is optimized to push triggered queries into a compact embedding cluster so poisoned items are retrieved with high probability. On three real agents (driving, QA, healthcare) AGENTPOISON reaches ~81% retrieval success, ~59% trigger-induced action rate, and ~63% end-to-end real-world impact while degrading benign accuracy by ≲1% and using <0.1% poisoned instances. The trigger transfers across different dense retrievers and resists basic defenses like perplexity filtering and rephrasing.
Problem Statement
LLM agents use retrieval memory or RAG to fetch past examples. If an attacker can inject a few poisoned examples, the agent can be steered to dangerous outputs. Existing jailbreaks and backdoors fail to reliably force retrieval in RAG-based agents. We need a method to reliably make poisoned demonstrations be retrieved only when a covert trigger appears, without breaking benign behavior.
Main Contribution
A practical backdoor attack (AGENTPOISON) that poisons long-term memory or RAG knowledge bases with very few malicious demonstrations.
A constrained trigger-optimization method that maps triggered queries into a unique, compact embedding region to force retrieval of poisoned items.
Key Findings
AGENTPOISON forces retrieval of poisoned demonstrations with high probability.
The attack causes real agent-level harm in many cases.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASR-r (retrieval success rate) | ≈81.2% average for AGENTPOISON | lower for CPA, GCG, BadChain | — | aggregate across three agents | Table 1; Sec. 4.2 | Table 1 |
| ASR-a (target action generation) | ≈59.4% average for AGENTPOISON | — | — | aggregate across three agents | Sec. 4.2; Table 1 | Table 1 |
What To Try In 7 Days
Audit which systems can write to retrieval corpora and lock write access.
Run a red-team using public embedder to check retrieval integrity and trigger transferability.
Add provenance checks, stricter ingestion validation, and alerts when a small subset of retrieved items dominate decisions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Trigger optimization assumes white-box access to the retrieval embedder; transferability mitigates this but success can vary by embedder.
Experiments cover three agent types and dense retrievers; results may differ for sparse retrievers or other agent designs.
When Not To Use
When retrieval corpora are closed, immutable, or cryptographically verified.
When only sparse lexical retrievers (BM25) are used without dense embeddings.
Failure Modes
Trigger semantics destroyed by arbitrary token edits (letter injection) that change embedding cluster.
Robust provenance checks or write-restrictions block poisoned-instance injection.

