Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.
Summary TLDR
The paper defines a class of attacks that induce logic malfunctions in autonomous LLM agents (infinite loops or incorrect but benign actions). Using an LLM-based agent emulator plus two implemented agents (Gmail, CSV), the authors show prompt-injection style attacks raise failure rates from ~15% baseline to ~59% on average and up to 88% on some cores. Adversarial text perturbations and adversarial demonstrations are largely ineffective. Multi-agent setups let an infected agent propagate malfunctions to others (≈80% in tested chains). Simple LLM self-examination detects overt harmful prompts but largely fails to flag these malfunction attacks. Practical fixes need external guards, input sanit
Problem Statement
LLM agents can act in the world via tools and thus have new attack surfaces. Existing red-teaming focuses on overtly harmful outputs, not on attacks that quietly make agents malfunction (repeat actions, run irrelevant functions). The paper asks: how fragile are agents to attacks that amplify natural instability, and can built-in LLM self-checks detect them?
Main Contribution
Define a new attack class that forces agent malfunctions (infinite loops or incorrect benign actions).
Large-scale emulator study: 144 test cases across 36 toolkits (>300 tools) plus two implemented agents (Gmail, CSV).
Show prompt injection is highly effective; perturbation and demonstration attacks are weak.
Show multi-agent chains let malfunctions spread; LLM self-examination is a weak defense against these attacks.
Key Findings
Prompt-injection infinite-loop attacks raise failure rate substantially.
Effectiveness depends on core model; some cores are especially vulnerable.
Adversarial text perturbations and paraphrases had low impact.
Adversarial demonstrations failed to manipulate agents in tested cases.
Intermediate outputs and tool responses can be attack surfaces, with behavior varying by agent type.
Multi-agent chains can propagate malfunctions from robust agents to fragile ones.
LLM self-examination defenses detect obvious harmful prompts but fail on malfunction attacks.
Results
Baseline failure rate (no attack)
Infinite-loop prompt injection ASR (emulator)
Prompt injection ASR by core model
Adversarial perturbation ASR (GCG/SCPN/VIPER)
Case study — Gmail agent (user input)
Case study — CSV agent (user input)
Intermediate output attack (Gmail vs CSV)
Multi-agent propagation (advanced attacks)
Self-examination anomaly detection (policy vs malfunction)
Who Should Care
What To Try In 7 Days
Treat all external text (user inputs, API outputs, files) as untrusted and add strict parsing/whitelisting.
Add a non-LLM runtime guard that checks action requests before execution (rate limits, action whitelists, confirmation flows).
Run the provided emulator tests or simple internal harness to reproduce prompt-injection failure modes on your agents.
Agent Features
Memory
- Conversation/history memory (short-term storage)
Planning
- ReAct (stepwise reasoning + action selection)
Tool Use
- APIs (Gmail, Twilio, WolframAlpha)
- Python toolkits and file operations (CSV analysis)
Frameworks
- LangChain
- LM-based agent emulator
Is Agentic
true
Architectures
- LLM core + planning + tools + memory
Collaboration
- Multi-agent communication chains (agent-to-agent messages)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only two implemented agents (Gmail, CSV); emulator coverage does not replace full production integrations.
- Evaluations use three closed-source LLM cores; open-source and other models not tested.
- Agent emulator may differ from real APIs and runtime safety checks.
When Not To Use
- If agent runs behind strict external authorization and non-LLM action gating
- If agents never execute external or user-provided textual actions
Failure Modes
- Infinite repeat loops of actions (resource exhaustion)
- Execution of irrelevant but benign functions (spamming, wasted work)
- Propagation through agent-to-agent communication chains
- Missed detection by LLM-only self-examination
Core Entities
Models
- GPT-3.5-Turbo
- GPT-3.5-Turbo-16k
- GPT-4
- Claude-2
Metrics
- Attack Success Rate (ASR)
- Failure rate (task completion)
- Anomaly detection rate (self-examination)
Datasets
- Agent emulator test suite (144 test cases, 36 toolkits, >300 tools)
- Case-study tasks: Gmail agent tasks
- Case-study tasks: CSV agent tasks

