Overview
Production Readiness
0.3
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Agent workflows can be covertly controlled without hurting visible task accuracy, so enterprises must monitor agent state and retrieval channels to avoid stealthy compromises.
Summary TLDR
This paper introduces BackdoorAgent, a stage-aware framework and benchmark to study hidden (backdoor) attacks in LLM-based agents. It breaks agent workflows into planning, memory (retrieval), and tool stages, provides hooks and trajectory logging, and runs standardized attacks across four agent tasks (QA, Code, Web, Drive). Key findings: memory-channel backdoors are the most persistent; triggers often steer behavior while leaving task accuracy largely intact; simple token-probability detectors transfer poorly. Code and benchmark are public.
Problem Statement
LLM agents use multi-step workflows with persistent plans, memories, and tool feedback. That persistence widens the attack surface: a backdoor injected into one stage can survive, move across stages, and cause harmful long-horizon behavior. Prior work studies channels in isolation, so we lack a unified, agent-centric view and standardized tests to measure cross-stage propagation.
Main Contribution
BackdoorAgent: a modular, stage-aware runtime that instruments planning, memory, and tool interfaces and logs full trajectories.
A standardized benchmark with four tasks (Agent QA, Agent Code, Agent Web, Agent Drive) across language and multimodal settings.
Systematic evaluation across closed- and open-source backbones showing memory attacks are most persistent and that attacks can preserve task accuracy.
Public release of code and benchmark to enable reproducible trajectory-level study (GitHub link provided).
Key Findings
Memory-channel backdoors are the most persistent across agent families.
Planning and tool-stage triggers also persist but with lower and more variable rates.
High attack success can coexist with preserved or improved task accuracy.
Token-probability signals used for single-turn LLM backdoor detection transfer poorly to agents.
Closed-loop sequential tasks amplify small perturbations into large failures.
Results
Memory ASR (aggregated)
Planning ASR (aggregated)
Tools ASR (aggregated)
Who Should Care
What To Try In 7 Days
Run BackdoorAgent on your agent with your top backbones and sample tasks.
Log full agent trajectories (plans, retrieved snippets, tool outputs) for a subset of runs.
Simulate poisoned memory entries to measure ASR and check recovery steps.
Agent Features
Memory
- retrieval / persistent memory store
- retrieved snippets reinjected into context
Planning
- explicit plan generation
- planner traces written to context
Tool Use
- external tool execution hooks
- tool-output injection points
Frameworks
- BackdoorAgent
Is Agentic
true
Architectures
- LLM-based agent
- Tool-augmented agent
- RAG-enabled agent
Optimization Features
Token Efficiency
- Planning attacks are token-efficient; memory attacks incur higher token overhead as trigger strength
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Defense evaluation is preliminary and focuses mainly on token-probability baselines.
- Benchmark covers four representative tasks but not every production workflow.
- Some results rely on closed-source backbones, limiting full reproducibility for those rows.
When Not To Use
- When you need provable, formally verified security guarantees.
- When your system lacks retrieval or multi-step planning (single-turn LLMs).
Failure Modes
- Backdoors remain stealthy by preserving task-level accuracy.
- Delayed activation: triggers can appear many steps before harmful outputs.
- Cross-stage propagation: a single injected artifact can contaminate later planning and tools.
Core Entities
Models
- gpt-4o-mini-0718-global
- gpt-5-mini-0807-global
- gpt-family (claude, gemini, gpt)
- qwen3-max
- deepseek-r1-671b
- deepseek-v3.2-exp
- kimi-k2
- qwen2.5-72b-instruct
- qwen3-235b-a22b
- qwen3-vl-235b
Metrics
- Clean ACC
- ASR (Attack Success Rate)
- ACC under attack
Datasets
- BackdoorAgent benchmark tasks: Agent QA, Agent Code, Agent Web, Agent Drive
Benchmarks
- BackdoorAgent (stage-aware backdoor benchmark)

