Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.
Summary TLDR
The paper builds data-flow prompt-injection attacks that target tool-using LLM agents (a banking agent). The authors integrate these attacks into the AgentDojo benchmark, create a richer synthetic banking conversation dataset, and test six LLMs across 16 then 48 user tasks. Average targeted attack success rates (ASR) are ~15–20% across evaluated models and tasks; some defenses (prompt-injection detector, repeating user prompt) can reduce ASR to near zero on the smaller suite but none fully prevent leakage on the expanded 48-task set. Models rarely disclose passwords alone, but password leakage rises when passwords are requested together with other personal fields. Tasks that retrieve data or
Problem Statement
Tool-using LLM agents can be tricked by injected text inside inputs (prompt injection) into revealing personal data they saw while performing a task. The paper asks: how effective are simple data-flow prompt injections at exfiltrating agent-observed data, which models and tasks are most vulnerable, and which lightweight defenses reduce leakage?
Main Contribution
Design of data-flow prompt injection attacks that target data exfiltration from tool-calling agents.
Integration of those attacks into AgentDojo's banking suite to measure leakage on practical user tasks.
Creation of a richer synthetic human–AI banking conversation dataset and an expanded 48-task evaluation set.
Key Findings
Average attack success rate (ASR) across models and tasks is around 15–20%.
Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.
Defenses can reduce ASR to near-zero but often at the cost of reduced benign utility.
Attack effectiveness depends on task type: data-retrieval and authorization tasks are more vulnerable.
Attack phrasing and attacker knowledge matter: 'Important message' and adaptive (Max) phrasing are more effective.
Results
Average targeted ASR (16 AgentDojo tasks)
Average targeted ASR (expanded 48 tasks)
Utility drop under attack
Defense effectiveness (GPT-4o, 16 tasks)
Defense effectiveness (GPT-4o, 48 tasks)
Who Should Care
What To Try In 7 Days
Run AgentDojo-style tests (or the paper's 48-task suite) against your agent to measure ASR and utility under attack.
Add a lightweight prompt-injection detector and a repeat-user-prompt check; measure the utility trade-off.
Audit flows that return multiple personal fields together (especially passwords) and avoid bundling credentials with other data.
Agent Features
Memory
- short-term execution observations (data seen during task)
Planning
- multi-step task planning
Tool Use
- email tool
- account lookup functions
- calendar/storage/email style tools
Frameworks
- AgentDojo
Is Agentic
true
Architectures
- LLM with function-calling
Collaboration
- single-agent interacting with external tools
Reproducibility
Code Urls
- AgentDojo GitHub (paper states AgentDojo repo contains code/data)
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Threat model assumes attacker can inject content into application inputs and may know the retrieval system (white-box), which is stronger than some production threat models.
- Synthetic dataset approximates real conversations; results may differ on real user data.
- Experiments focus on a banking environment; other domains (insurance, crypto, stocks) may behave differently.
- Dataset and tool implementations influence measured ASR; results are sensitive to task wording and model system prompts.
When Not To Use
- If your agent has strong design-level isolation like CaMeL-style architectures (not evaluated here).
- If your production system never exposes attacker-controlled content to the agent.
- If your model architecture does not use function-calling or external tools.
Failure Modes
- Adding detectors reduces agent utility and may break legitimate flows.
- Defenses tuned on a small task set may fail on broader or domain-specific prompts.
- Model updates can change leakage behavior, invalidating prior tests.
- Simulated attacker knowledge (name guessing) underestimates skilled adversaries.
Core Entities
Models
- GPT-3.5 Turbo
- GPT-4 Turbo
- GPT-4o
- Llama-3 (70B)
- Llama-4 (17B)
- Claude 3.5 Sonnet
Metrics
- Benign Utility
- Utility Under Attack
- Targeted Attack Success Rate (ASR)
- Password leakage rate
Datasets
- AgentDojo banking suite
- Synthetic banking conversation dataset (this paper)
Benchmarks
- AgentDojo

