Overview
The framework and benchmark provide practical tooling and repeatable tests, but defenses are preliminary and results vary by task and backbone; interpret numbers as indicative for common agent designs.
Citations0
Evidence Strength0.80
Confidence0.88
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 70%
Why It Matters For Business
Agent workflows can be covertly controlled without hurting visible task accuracy, so enterprises must monitor agent state and retrieval channels to avoid stealthy compromises.
Who Should Care
Summary TLDR
This paper introduces BackdoorAgent, a stage-aware framework and benchmark to study hidden (backdoor) attacks in LLM-based agents. It breaks agent workflows into planning, memory (retrieval), and tool stages, provides hooks and trajectory logging, and runs standardized attacks across four agent tasks (QA, Code, Web, Drive). Key findings: memory-channel backdoors are the most persistent; triggers often steer behavior while leaving task accuracy largely intact; simple token-probability detectors transfer poorly. Code and benchmark are public.
Problem Statement
LLM agents use multi-step workflows with persistent plans, memories, and tool feedback. That persistence widens the attack surface: a backdoor injected into one stage can survive, move across stages, and cause harmful long-horizon behavior. Prior work studies channels in isolation, so we lack a unified, agent-centric view and standardized tests to measure cross-stage propagation.
Main Contribution
BackdoorAgent: a modular, stage-aware runtime that instruments planning, memory, and tool interfaces and logs full trajectories.
A standardized benchmark with four tasks (Agent QA, Agent Code, Agent Web, Agent Drive) across language and multimodal settings.
Key Findings
Memory-channel backdoors are the most persistent across agent families.
Planning and tool-stage triggers also persist but with lower and more variable rates.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Memory ASR (aggregated) | 77.97% | — | — | All tasks, aggregated (Table 5 top row) | Average memory-channel attack success rate across representative closed-source families | Table 5; Abstract |
| Planning ASR (aggregated) | 43.58% | — | — | All tasks, aggregated (Table 5 top row) | Average planning-channel attack success rate across representative closed-source families | Table 5; Abstract |
What To Try In 7 Days
Run BackdoorAgent on your agent with your top backbones and sample tasks.
Log full agent trajectories (plans, retrieved snippets, tool outputs) for a subset of runs.
Simulate poisoned memory entries to measure ASR and check recovery steps.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Planning attacks are token-efficient; memory attacks incur higher token overhead as trigger strength
Reproducibility
Risks & Boundaries
Limitations
Defense evaluation is preliminary and focuses mainly on token-probability baselines.
Benchmark covers four representative tasks but not every production workflow.
When Not To Use
When you need provable, formally verified security guarantees.
When your system lacks retrieval or multi-step planning (single-turn LLMs).
Failure Modes
Backdoors remain stealthy by preserving task-level accuracy.
Delayed activation: triggers can appear many steps before harmful outputs.

