BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Overview

Decision SnapshotNeeds Validation

The framework and benchmark provide practical tooling and repeatable tests, but defenses are preliminary and results vary by task and backbone; interpret numbers as indicative for common agent designs.

Citations0

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 70%

Authors

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yu-Gang Jiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agent workflows can be covertly controlled without hurting visible task accuracy, so enterprises must monitor agent state and retrieval channels to avoid stealthy compromises.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces BackdoorAgent, a stage-aware framework and benchmark to study hidden (backdoor) attacks in LLM-based agents. It breaks agent workflows into planning, memory (retrieval), and tool stages, provides hooks and trajectory logging, and runs standardized attacks across four agent tasks (QA, Code, Web, Drive). Key findings: memory-channel backdoors are the most persistent; triggers often steer behavior while leaving task accuracy largely intact; simple token-probability detectors transfer poorly. Code and benchmark are public.

Problem Statement

LLM agents use multi-step workflows with persistent plans, memories, and tool feedback. That persistence widens the attack surface: a backdoor injected into one stage can survive, move across stages, and cause harmful long-horizon behavior. Prior work studies channels in isolation, so we lack a unified, agent-centric view and standardized tests to measure cross-stage propagation.

Main Contribution

BackdoorAgent: a modular, stage-aware runtime that instruments planning, memory, and tool interfaces and logs full trajectories.

A standardized benchmark with four tasks (Agent QA, Agent Code, Agent Web, Agent Drive) across language and multimodal settings.

Key Findings

Memory-channel backdoors are the most persistent across agent families.

NumbersMemory ASR ≈ 77.97% (aggregated, Table 5 top row)

Practical UseTest and harden retrieval/memory layers first: poisoned retrieval snippets can repeatedly reinfect agent context and drive behavior.

Evidence RefTable 5; Abstract

Planning and tool-stage triggers also persist but with lower and more variable rates.

NumbersPlanning ASR ≈ 43.58%; Tools ASR ≈ 60.28% (aggregated, Table 5)

Practical UseAudit plan-generation and tool outputs; both can enable attacks, especially in closed-loop control tasks where feedback compounds.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Memory ASR (aggregated)	77.97%	—	—	All tasks, aggregated (Table 5 top row)	Average memory-channel attack success rate across representative closed-source families	Table 5; Abstract
Planning ASR (aggregated)	43.58%	—	—	All tasks, aggregated (Table 5 top row)	Average planning-channel attack success rate across representative closed-source families	Table 5; Abstract

What To Try In 7 Days

Run BackdoorAgent on your agent with your top backbones and sample tasks.

Log full agent trajectories (plans, retrieved snippets, tool outputs) for a subset of runs.

Simulate poisoned memory entries to measure ASR and check recovery steps.

Agent Features

Memory

retrieval / persistent memory storeretrieved snippets reinjected into context

Planning

explicit plan generationplanner traces written to context

Tool Use

external tool execution hookstool-output injection points

Frameworks

BackdoorAgent

Is Agentic

Yes

Architectures

LLM-based agentTool-augmented agentRAG-enabled agent

Optimization Features

Token Efficiency

Planning attacks are token-efficient; memory attacks incur higher token overhead as trigger strength

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Yunhao-Feng/BackdoorAgent

Data URLs

https://github.com/Yunhao-Feng/BackdoorAgent (benchmark and loaders)

Risks & Boundaries

Limitations

Defense evaluation is preliminary and focuses mainly on token-probability baselines.

Benchmark covers four representative tasks but not every production workflow.

When Not To Use

When you need provable, formally verified security guarantees.

When your system lacks retrieval or multi-step planning (single-turn LLMs).

Failure Modes

Backdoors remain stealthy by preserving task-level accuracy.

Delayed activation: triggers can appear many steps before harmful outputs.

Core Entities

Models

gpt-4o-mini-0718-globalgpt-5-mini-0807-globalgpt-family (claude, gemini, gpt)qwen3-maxdeepseek-r1-671bdeepseek-v3.2-expkimi-k2qwen2.5-72b-instructqwen3-235b-a22bqwen3-vl-235b

Metrics

Clean ACCASR (Attack Success Rate)ACC under attack

Datasets

BackdoorAgent benchmark tasks: Agent QA, Agent Code, Agent Web, Agent Drive

Benchmarks

BackdoorAgent (stage-aware backdoor benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Memory-channel backdoors are the most persistent across agent families.

Planning and tool-stage triggers also persist but with lower and more variable rates.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Use formal EDA feedback inside a multi-agent controller to improve Verilog generation without expensive fine-tuning.

Key finding