Overview
The attacks are practical and tested across multiple off-the-shelf models and an expanded task set, but results rely on synthetic data and a specific threat model; defenses show promise but reduce utility.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.
Who Should Care
Summary TLDR
The paper builds data-flow prompt-injection attacks that target tool-using LLM agents (a banking agent). The authors integrate these attacks into the AgentDojo benchmark, create a richer synthetic banking conversation dataset, and test six LLMs across 16 then 48 user tasks. Average targeted attack success rates (ASR) are ~15–20% across evaluated models and tasks; some defenses (prompt-injection detector, repeating user prompt) can reduce ASR to near zero on the smaller suite but none fully prevent leakage on the expanded 48-task set. Models rarely disclose passwords alone, but password leakage rises when passwords are requested together with other personal fields. Tasks that retrieve data or
Problem Statement
Tool-using LLM agents can be tricked by injected text inside inputs (prompt injection) into revealing personal data they saw while performing a task. The paper asks: how effective are simple data-flow prompt injections at exfiltrating agent-observed data, which models and tasks are most vulnerable, and which lightweight defenses reduce leakage?
Main Contribution
Design of data-flow prompt injection attacks that target data exfiltration from tool-calling agents.
Integration of those attacks into AgentDojo's banking suite to measure leakage on practical user tasks.
Key Findings
Average attack success rate (ASR) across models and tasks is around 15–20%.
Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average targeted ASR (16 AgentDojo tasks) | ≈20% (most models); Llama-4 (17B) = 40% | — | — | AgentDojo banking 16 tasks | Section 4.1; Figures 2 and 3 | Figures 2,3; text in 4.1 |
| Average targeted ASR (expanded 48 tasks) | ≈15% | — | — | Expanded synthetic + AgentDojo 48 tasks | Section 4.4; Figure 6 | Figure 6; Table 7 |
What To Try In 7 Days
Run AgentDojo-style tests (or the paper's 48-task suite) against your agent to measure ASR and utility under attack.
Add a lightweight prompt-injection detector and a repeat-user-prompt check; measure the utility trade-off.
Audit flows that return multiple personal fields together (especially passwords) and avoid bundling credentials with other data.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Threat model assumes attacker can inject content into application inputs and may know the retrieval system (white-box), which is stronger than some production threat models.
Synthetic dataset approximates real conversations; results may differ on real user data.
When Not To Use
If your agent has strong design-level isolation like CaMeL-style architectures (not evaluated here).
If your production system never exposes attacker-controlled content to the agent.
Failure Modes
Adding detectors reduces agent utility and may break legitimate flows.
Defenses tuned on a small task set may fail on broader or domain-specific prompts.

