BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

January 8, 20267 min

Overview

Decision SnapshotNeeds Validation

The framework and benchmark provide practical tooling and repeatable tests, but defenses are preliminary and results vary by task and backbone; interpret numbers as indicative for common agent designs.

Citations0

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 70%

Authors

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yu-Gang Jiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agent workflows can be covertly controlled without hurting visible task accuracy, so enterprises must monitor agent state and retrieval channels to avoid stealthy compromises.

Who Should Care

Summary TLDR

This paper introduces BackdoorAgent, a stage-aware framework and benchmark to study hidden (backdoor) attacks in LLM-based agents. It breaks agent workflows into planning, memory (retrieval), and tool stages, provides hooks and trajectory logging, and runs standardized attacks across four agent tasks (QA, Code, Web, Drive). Key findings: memory-channel backdoors are the most persistent; triggers often steer behavior while leaving task accuracy largely intact; simple token-probability detectors transfer poorly. Code and benchmark are public.

Problem Statement

LLM agents use multi-step workflows with persistent plans, memories, and tool feedback. That persistence widens the attack surface: a backdoor injected into one stage can survive, move across stages, and cause harmful long-horizon behavior. Prior work studies channels in isolation, so we lack a unified, agent-centric view and standardized tests to measure cross-stage propagation.

Main Contribution

BackdoorAgent: a modular, stage-aware runtime that instruments planning, memory, and tool interfaces and logs full trajectories.

A standardized benchmark with four tasks (Agent QA, Agent Code, Agent Web, Agent Drive) across language and multimodal settings.

Key Findings

Memory-channel backdoors are the most persistent across agent families.

NumbersMemory ASR ≈ 77.97% (aggregated, Table 5 top row)

Practical UseTest and harden retrieval/memory layers first: poisoned retrieval snippets can repeatedly reinfect agent context and drive behavior.

Evidence RefTable 5; Abstract

Planning and tool-stage triggers also persist but with lower and more variable rates.

NumbersPlanning ASR ≈ 43.58%; Tools ASR ≈ 60.28% (aggregated, Table 5)

Practical UseAudit plan-generation and tool outputs; both can enable attacks, especially in closed-loop control tasks where feedback compounds.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Memory ASR (aggregated)77.97%All tasks, aggregated (Table 5 top row)Average memory-channel attack success rate across representative closed-source familiesTable 5; Abstract
Planning ASR (aggregated)43.58%All tasks, aggregated (Table 5 top row)Average planning-channel attack success rate across representative closed-source familiesTable 5; Abstract

What To Try In 7 Days

Run BackdoorAgent on your agent with your top backbones and sample tasks.

Log full agent trajectories (plans, retrieved snippets, tool outputs) for a subset of runs.

Simulate poisoned memory entries to measure ASR and check recovery steps.

Agent Features

Memory
retrieval / persistent memory storeretrieved snippets reinjected into context
Planning
explicit plan generationplanner traces written to context
Tool Use
external tool execution hookstool-output injection points
Frameworks
BackdoorAgent
Is Agentic

Yes

Architectures
LLM-based agentTool-augmented agentRAG-enabled agent

Optimization Features

Token Efficiency

Planning attacks are token-efficient; memory attacks incur higher token overhead as trigger strength

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Defense evaluation is preliminary and focuses mainly on token-probability baselines.

Benchmark covers four representative tasks but not every production workflow.

When Not To Use

When you need provable, formally verified security guarantees.

When your system lacks retrieval or multi-step planning (single-turn LLMs).

Failure Modes

Backdoors remain stealthy by preserving task-level accuracy.

Delayed activation: triggers can appear many steps before harmful outputs.

Core Entities

Models

gpt-4o-mini-0718-globalgpt-5-mini-0807-globalgpt-family (claude, gemini, gpt)qwen3-maxdeepseek-r1-671bdeepseek-v3.2-expkimi-k2qwen2.5-72b-instructqwen3-235b-a22bqwen3-vl-235b

Metrics

Clean ACCASR (Attack Success Rate)ACC under attack

Datasets

BackdoorAgent benchmark tasks: Agent QA, Agent Code, Agent Web, Agent Drive

Benchmarks

BackdoorAgent (stage-aware backdoor benchmark)