Overview
Results combine large-scale emulator runs and two implemented agents; emulator approximations limit direct transfer to every production setup, but attack trends and defenses are well supported.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 8/9
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.
Who Should Care
Summary TLDR
The paper defines a class of attacks that induce logic malfunctions in autonomous LLM agents (infinite loops or incorrect but benign actions). Using an LLM-based agent emulator plus two implemented agents (Gmail, CSV), the authors show prompt-injection style attacks raise failure rates from ~15% baseline to ~59% on average and up to 88% on some cores. Adversarial text perturbations and adversarial demonstrations are largely ineffective. Multi-agent setups let an infected agent propagate malfunctions to others (≈80% in tested chains). Simple LLM self-examination detects overt harmful prompts but largely fails to flag these malfunction attacks. Practical fixes need external guards, input sanit
Problem Statement
LLM agents can act in the world via tools and thus have new attack surfaces. Existing red-teaming focuses on overtly harmful outputs, not on attacks that quietly make agents malfunction (repeat actions, run irrelevant functions). The paper asks: how fragile are agents to attacks that amplify natural instability, and can built-in LLM self-checks detect them?
Main Contribution
Define a new attack class that forces agent malfunctions (infinite loops or incorrect benign actions).
Large-scale emulator study: 144 test cases across 36 toolkits (>300 tools) plus two implemented agents (Gmail, CSV).
Key Findings
Prompt-injection infinite-loop attacks raise failure rate substantially.
Effectiveness depends on core model; some cores are especially vulnerable.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Baseline failure rate (no attack) | 15.3% (emulator average) | — | — | Emulator suite | Reported baseline failure rate across emulator tests | Table 1 |
| Infinite-loop prompt injection ASR (emulator) | 59.4% average (emulator) | 15.3% | +44.1 pp | Emulator suite | Prompt injection infinite-loop attack raised failure rate to 59.4% | Table 1 |
What To Try In 7 Days
Treat all external text (user inputs, API outputs, files) as untrusted and add strict parsing/whitelisting.
Add a non-LLM runtime guard that checks action requests before execution (rate limits, action whitelists, confirmation flows).
Run the provided emulator tests or simple internal harness to reproduce prompt-injection failure modes on your agents.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Only two implemented agents (Gmail, CSV); emulator coverage does not replace full production integrations.
Evaluations use three closed-source LLM cores; open-source and other models not tested.
When Not To Use
If agent runs behind strict external authorization and non-LLM action gating
If agents never execute external or user-provided textual actions
Failure Modes
Infinite repeat loops of actions (resource exhaustion)
Execution of irrelevant but benign functions (spamming, wasted work)

