Prompt-based attacks can make LLM agents loop or run wrong benign actions; some attacks hit >80% failure rates

July 30, 20249 min

Overview

Decision SnapshotNeeds Validation

Results combine large-scale emulator runs and two implemented agents; emulator approximations limit direct transfer to every production setup, but attack trends and defenses are well supported.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 8/9

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 60%

Authors

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang

Links

Abstract / PDF

Why It Matters For Business

Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.

Who Should Care

Summary TLDR

The paper defines a class of attacks that induce logic malfunctions in autonomous LLM agents (infinite loops or incorrect but benign actions). Using an LLM-based agent emulator plus two implemented agents (Gmail, CSV), the authors show prompt-injection style attacks raise failure rates from ~15% baseline to ~59% on average and up to 88% on some cores. Adversarial text perturbations and adversarial demonstrations are largely ineffective. Multi-agent setups let an infected agent propagate malfunctions to others (≈80% in tested chains). Simple LLM self-examination detects overt harmful prompts but largely fails to flag these malfunction attacks. Practical fixes need external guards, input sanit

Problem Statement

LLM agents can act in the world via tools and thus have new attack surfaces. Existing red-teaming focuses on overtly harmful outputs, not on attacks that quietly make agents malfunction (repeat actions, run irrelevant functions). The paper asks: how fragile are agents to attacks that amplify natural instability, and can built-in LLM self-checks detect them?

Main Contribution

Define a new attack class that forces agent malfunctions (infinite loops or incorrect benign actions).

Large-scale emulator study: 144 test cases across 36 toolkits (>300 tools) plus two implemented agents (Gmail, CSV).

Key Findings

Prompt-injection infinite-loop attacks raise failure rate substantially.

NumbersBaseline 15.3% → Infinite loop ASR 59.4%

Practical UseTreat user inputs as high-risk. Sanitize and block injected instructions before passing them to the agent.

Evidence RefTable 1

Effectiveness depends on core model; some cores are especially vulnerable.

NumbersPrompt injection ASR: GPT-3.5 59.4%, GPT-4 32.1%, Claude-2 88.1%

Practical UseModel choice reduces but does not eliminate risk; rely on system-level checks, not only a stronger LLM.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Baseline failure rate (no attack)15.3% (emulator average)Emulator suiteReported baseline failure rate across emulator testsTable 1
Infinite-loop prompt injection ASR (emulator)59.4% average (emulator)15.3%+44.1 ppEmulator suitePrompt injection infinite-loop attack raised failure rate to 59.4%Table 1

What To Try In 7 Days

Treat all external text (user inputs, API outputs, files) as untrusted and add strict parsing/whitelisting.

Add a non-LLM runtime guard that checks action requests before execution (rate limits, action whitelists, confirmation flows).

Run the provided emulator tests or simple internal harness to reproduce prompt-injection failure modes on your agents.

Agent Features

Memory
Conversation/history memory (short-term storage)
Planning
ReAct (stepwise reasoning + action selection)
Tool Use
APIs (Gmail, Twilio, WolframAlpha)Python toolkits and file operations (CSV analysis)
Frameworks
LangChainLM-based agent emulator
Is Agentic

Yes

Architectures
LLM core + planning + tools + memory
Collaboration
Multi-agent communication chains (agent-to-agent messages)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Only two implemented agents (Gmail, CSV); emulator coverage does not replace full production integrations.

Evaluations use three closed-source LLM cores; open-source and other models not tested.

When Not To Use

If agent runs behind strict external authorization and non-LLM action gating

If agents never execute external or user-provided textual actions

Failure Modes

Infinite repeat loops of actions (resource exhaustion)

Execution of irrelevant but benign functions (spamming, wasted work)

Core Entities

Models

GPT-3.5-TurboGPT-3.5-Turbo-16kGPT-4Claude-2

Metrics

Attack Success Rate (ASR)Failure rate (task completion)Anomaly detection rate (self-examination)

Datasets

Agent emulator test suite (144 test cases, 36 toolkits, >300 tools)Case-study tasks: Gmail agent tasksCase-study tasks: CSV agent tasks