Prompt-based attacks can make LLM agents loop or run wrong benign actions; some attacks hit >80% failure rates

Overview

Decision SnapshotNeeds Validation

Results combine large-scale emulator runs and two implemented agents; emulator approximations limit direct transfer to every production setup, but attack trends and defenses are well supported.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 8/9

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 60%

Authors

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang

Links

Abstract / PDF

Why It Matters For Business

Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

The paper defines a class of attacks that induce logic malfunctions in autonomous LLM agents (infinite loops or incorrect but benign actions). Using an LLM-based agent emulator plus two implemented agents (Gmail, CSV), the authors show prompt-injection style attacks raise failure rates from ~15% baseline to ~59% on average and up to 88% on some cores. Adversarial text perturbations and adversarial demonstrations are largely ineffective. Multi-agent setups let an infected agent propagate malfunctions to others (≈80% in tested chains). Simple LLM self-examination detects overt harmful prompts but largely fails to flag these malfunction attacks. Practical fixes need external guards, input sanit

Problem Statement

LLM agents can act in the world via tools and thus have new attack surfaces. Existing red-teaming focuses on overtly harmful outputs, not on attacks that quietly make agents malfunction (repeat actions, run irrelevant functions). The paper asks: how fragile are agents to attacks that amplify natural instability, and can built-in LLM self-checks detect them?

Main Contribution

Define a new attack class that forces agent malfunctions (infinite loops or incorrect benign actions).

Large-scale emulator study: 144 test cases across 36 toolkits (>300 tools) plus two implemented agents (Gmail, CSV).

Key Findings

Prompt-injection infinite-loop attacks raise failure rate substantially.

NumbersBaseline 15.3% → Infinite loop ASR 59.4%

Practical UseTreat user inputs as high-risk. Sanitize and block injected instructions before passing them to the agent.

Evidence RefTable 1

Effectiveness depends on core model; some cores are especially vulnerable.

NumbersPrompt injection ASR: GPT-3.5 59.4%, GPT-4 32.1%, Claude-2 88.1%

Practical UseModel choice reduces but does not eliminate risk; rely on system-level checks, not only a stronger LLM.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Baseline failure rate (no attack)	15.3% (emulator average)	—	—	Emulator suite	Reported baseline failure rate across emulator tests	Table 1
Infinite-loop prompt injection ASR (emulator)	59.4% average (emulator)	15.3%	+44.1 pp	Emulator suite	Prompt injection infinite-loop attack raised failure rate to 59.4%	Table 1

What To Try In 7 Days

Treat all external text (user inputs, API outputs, files) as untrusted and add strict parsing/whitelisting.

Add a non-LLM runtime guard that checks action requests before execution (rate limits, action whitelists, confirmation flows).

Run the provided emulator tests or simple internal harness to reproduce prompt-injection failure modes on your agents.

Agent Features

Memory

Conversation/history memory (short-term storage)

Planning

ReAct (stepwise reasoning + action selection)

Tool Use

APIs (Gmail, Twilio, WolframAlpha)Python toolkits and file operations (CSV analysis)

Frameworks

LangChainLM-based agent emulator

Is Agentic

Yes

Architectures

LLM core + planning + tools + memory

Collaboration

Multi-agent communication chains (agent-to-agent messages)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Only two implemented agents (Gmail, CSV); emulator coverage does not replace full production integrations.

Evaluations use three closed-source LLM cores; open-source and other models not tested.

When Not To Use

If agent runs behind strict external authorization and non-LLM action gating

If agents never execute external or user-provided textual actions

Failure Modes

Infinite repeat loops of actions (resource exhaustion)

Execution of irrelevant but benign functions (spamming, wasted work)

Core Entities

Models

GPT-3.5-TurboGPT-3.5-Turbo-16kGPT-4Claude-2

Metrics

Attack Success Rate (ASR)Failure rate (task completion)Anomaly detection rate (self-examination)

Datasets

Agent emulator test suite (144 test cases, 36 toolkits, >300 tools)Case-study tasks: Gmail agent tasksCase-study tasks: CSV agent tasks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt-injection infinite-loop attacks raise failure rate substantially.

Effectiveness depends on core model; some cores are especially vulnerable.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding