BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

January 8, 20267 min

Overview

Production Readiness

0.3

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yu-Gang Jiang

Links

Abstract / PDF

Why It Matters For Business

Agent workflows can be covertly controlled without hurting visible task accuracy, so enterprises must monitor agent state and retrieval channels to avoid stealthy compromises.

Summary TLDR

This paper introduces BackdoorAgent, a stage-aware framework and benchmark to study hidden (backdoor) attacks in LLM-based agents. It breaks agent workflows into planning, memory (retrieval), and tool stages, provides hooks and trajectory logging, and runs standardized attacks across four agent tasks (QA, Code, Web, Drive). Key findings: memory-channel backdoors are the most persistent; triggers often steer behavior while leaving task accuracy largely intact; simple token-probability detectors transfer poorly. Code and benchmark are public.

Problem Statement

LLM agents use multi-step workflows with persistent plans, memories, and tool feedback. That persistence widens the attack surface: a backdoor injected into one stage can survive, move across stages, and cause harmful long-horizon behavior. Prior work studies channels in isolation, so we lack a unified, agent-centric view and standardized tests to measure cross-stage propagation.

Main Contribution

BackdoorAgent: a modular, stage-aware runtime that instruments planning, memory, and tool interfaces and logs full trajectories.

A standardized benchmark with four tasks (Agent QA, Agent Code, Agent Web, Agent Drive) across language and multimodal settings.

Systematic evaluation across closed- and open-source backbones showing memory attacks are most persistent and that attacks can preserve task accuracy.

Public release of code and benchmark to enable reproducible trajectory-level study (GitHub link provided).

Key Findings

Memory-channel backdoors are the most persistent across agent families.

NumbersMemory ASR ≈ 77.97% (aggregated, Table 5 top row)

Planning and tool-stage triggers also persist but with lower and more variable rates.

NumbersPlanning ASR ≈ 43.58%; Tools ASR ≈ 60.28% (aggregated, Table 5)

High attack success can coexist with preserved or improved task accuracy.

NumbersExample: AgentPoison ASR 88.34% with ACC 75.89% vs clean ACC 78.45% (Code, qwen2.5-72b)

Token-probability signals used for single-turn LLM backdoor detection transfer poorly to agents.

NumbersProbability-based detector yields only modest separability (AUROC modest in Fig.5)

Closed-loop sequential tasks amplify small perturbations into large failures.

NumbersBadChain and AdvAgent often exceed 90% ASR in Agent Drive on open-source backbones

Results

Memory ASR (aggregated)

Value77.97%

Planning ASR (aggregated)

Value43.58%

Tools ASR (aggregated)

Value60.28%

Who Should Care

What To Try In 7 Days

Run BackdoorAgent on your agent with your top backbones and sample tasks.

Log full agent trajectories (plans, retrieved snippets, tool outputs) for a subset of runs.

Simulate poisoned memory entries to measure ASR and check recovery steps.

Agent Features

Memory

  • retrieval / persistent memory store
  • retrieved snippets reinjected into context

Planning

  • explicit plan generation
  • planner traces written to context

Tool Use

  • external tool execution hooks
  • tool-output injection points

Frameworks

  • BackdoorAgent

Is Agentic

true

Architectures

  • LLM-based agent
  • Tool-augmented agent
  • RAG-enabled agent

Optimization Features

Token Efficiency

  • Planning attacks are token-efficient; memory attacks incur higher token overhead as trigger strength

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Defense evaluation is preliminary and focuses mainly on token-probability baselines.
  • Benchmark covers four representative tasks but not every production workflow.
  • Some results rely on closed-source backbones, limiting full reproducibility for those rows.

When Not To Use

  • When you need provable, formally verified security guarantees.
  • When your system lacks retrieval or multi-step planning (single-turn LLMs).

Failure Modes

  • Backdoors remain stealthy by preserving task-level accuracy.
  • Delayed activation: triggers can appear many steps before harmful outputs.
  • Cross-stage propagation: a single injected artifact can contaminate later planning and tools.

Core Entities

Models

  • gpt-4o-mini-0718-global
  • gpt-5-mini-0807-global
  • gpt-family (claude, gemini, gpt)
  • qwen3-max
  • deepseek-r1-671b
  • deepseek-v3.2-exp
  • kimi-k2
  • qwen2.5-72b-instruct
  • qwen3-235b-a22b
  • qwen3-vl-235b

Metrics

  • Clean ACC
  • ASR (Attack Success Rate)
  • ACC under attack

Datasets

  • BackdoorAgent benchmark tasks: Agent QA, Agent Code, Agent Web, Agent Drive

Benchmarks

  • BackdoorAgent (stage-aware backdoor benchmark)