Prompt-based attacks can make LLM agents loop or run wrong benign actions; some attacks hit >80% failure rates

July 30, 20249 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang

Links

Abstract / PDF

Why It Matters For Business

Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.

Summary TLDR

The paper defines a class of attacks that induce logic malfunctions in autonomous LLM agents (infinite loops or incorrect but benign actions). Using an LLM-based agent emulator plus two implemented agents (Gmail, CSV), the authors show prompt-injection style attacks raise failure rates from ~15% baseline to ~59% on average and up to 88% on some cores. Adversarial text perturbations and adversarial demonstrations are largely ineffective. Multi-agent setups let an infected agent propagate malfunctions to others (≈80% in tested chains). Simple LLM self-examination detects overt harmful prompts but largely fails to flag these malfunction attacks. Practical fixes need external guards, input sanit

Problem Statement

LLM agents can act in the world via tools and thus have new attack surfaces. Existing red-teaming focuses on overtly harmful outputs, not on attacks that quietly make agents malfunction (repeat actions, run irrelevant functions). The paper asks: how fragile are agents to attacks that amplify natural instability, and can built-in LLM self-checks detect them?

Main Contribution

Define a new attack class that forces agent malfunctions (infinite loops or incorrect benign actions).

Large-scale emulator study: 144 test cases across 36 toolkits (>300 tools) plus two implemented agents (Gmail, CSV).

Show prompt injection is highly effective; perturbation and demonstration attacks are weak.

Show multi-agent chains let malfunctions spread; LLM self-examination is a weak defense against these attacks.

Key Findings

Prompt-injection infinite-loop attacks raise failure rate substantially.

NumbersBaseline 15.3% → Infinite loop ASR 59.4%

Effectiveness depends on core model; some cores are especially vulnerable.

NumbersPrompt injection ASR: GPT-3.5 59.4%, GPT-4 32.1%, Claude-2 88.1%

Adversarial text perturbations and paraphrases had low impact.

NumbersGCG overall ASR 15.5%; SCPN & VIPER ≈14–15%

Adversarial demonstrations failed to manipulate agents in tested cases.

NumbersAdversarial demonstration success: 0% on case-study tests

Intermediate outputs and tool responses can be attack surfaces, with behavior varying by agent type.

NumbersGmail external incorrect-function ASR 60% vs Gmail user-input 75%; CSV external ASR 0%

Multi-agent chains can propagate malfunctions from robust agents to fragile ones.

NumbersCSV→Gmail chain ASR ≈80% on tested cases

LLM self-examination defenses detect obvious harmful prompts but fail on malfunction attacks.

NumbersPolicy self-exam: Direct harm 90–100% detected; Infinite loop & incorrect function 0% detected (Table 6)

Results

Baseline failure rate (no attack)

Value15.3% (emulator average)

Infinite-loop prompt injection ASR (emulator)

Value59.4% average (emulator)

Baseline15.3%

Prompt injection ASR by core model

ValueGPT-3.5 59.4% | GPT-4 32.1% | Claude-2 88.1%

Baselinerespective model baselines 15.3%/9.1%/10.5%

Adversarial perturbation ASR (GCG/SCPN/VIPER)

ValueGCG 15.5% | SCPN 14.2% | VIPER 15.1% (GPT-3.5 baseline comparisons)

Baseline15.3%

Case study — Gmail agent (user input)

ValueInfinite loop 90.0% ASR | Incorrect function 75.0% ASR

Baseline0% (no attack in these case tests)

Case study — CSV agent (user input)

ValueInfinite loop 85.0% ASR | Incorrect function 90.0% ASR

Baseline0%

Intermediate output attack (Gmail vs CSV)

ValueGmail external infinite loop 20.0% | external incorrect 60.0%; CSV external 0%

Baselineuser-input ASR (see case-study values)

Multi-agent propagation (advanced attacks)

ValueSame-type propagation ASR 30–50% | Different-type chain (CSV→Gmail) ASR ≈80%

Baselinesingle-agent ASR

Self-examination anomaly detection (policy vs malfunction)

ValuePolicy self-exam: direct harm 90–100% detected; infinite loop & incorrect function 0% detected

BaselineN/A

Who Should Care

What To Try In 7 Days

Treat all external text (user inputs, API outputs, files) as untrusted and add strict parsing/whitelisting.

Add a non-LLM runtime guard that checks action requests before execution (rate limits, action whitelists, confirmation flows).

Run the provided emulator tests or simple internal harness to reproduce prompt-injection failure modes on your agents.

Agent Features

Memory

  • Conversation/history memory (short-term storage)

Planning

  • ReAct (stepwise reasoning + action selection)

Tool Use

  • APIs (Gmail, Twilio, WolframAlpha)
  • Python toolkits and file operations (CSV analysis)

Frameworks

  • LangChain
  • LM-based agent emulator

Is Agentic

true

Architectures

  • LLM core + planning + tools + memory

Collaboration

  • Multi-agent communication chains (agent-to-agent messages)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only two implemented agents (Gmail, CSV); emulator coverage does not replace full production integrations.
  • Evaluations use three closed-source LLM cores; open-source and other models not tested.
  • Agent emulator may differ from real APIs and runtime safety checks.

When Not To Use

  • If agent runs behind strict external authorization and non-LLM action gating
  • If agents never execute external or user-provided textual actions

Failure Modes

  • Infinite repeat loops of actions (resource exhaustion)
  • Execution of irrelevant but benign functions (spamming, wasted work)
  • Propagation through agent-to-agent communication chains
  • Missed detection by LLM-only self-examination

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-3.5-Turbo-16k
  • GPT-4
  • Claude-2

Metrics

  • Attack Success Rate (ASR)
  • Failure rate (task completion)
  • Anomaly detection rate (self-examination)

Datasets

  • Agent emulator test suite (144 test cases, 36 toolkits, >300 tools)
  • Case-study tasks: Gmail agent tasks
  • Case-study tasks: CSV agent tasks