Straightforward prompt injections can make tool-using LLM agents leak user data seen during a task.

June 1, 20258 min

Overview

Decision SnapshotNeeds Validation

The attacks are practical and tested across multiple off-the-shelf models and an expanded task set, but results rely on synthetic data and a specific threat model; defenses show promise but reduce utility.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, Fabrizio Gilardi

Links

Abstract / PDF / Code

Why It Matters For Business

If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.

Who Should Care

Summary TLDR

The paper builds data-flow prompt-injection attacks that target tool-using LLM agents (a banking agent). The authors integrate these attacks into the AgentDojo benchmark, create a richer synthetic banking conversation dataset, and test six LLMs across 16 then 48 user tasks. Average targeted attack success rates (ASR) are ~15–20% across evaluated models and tasks; some defenses (prompt-injection detector, repeating user prompt) can reduce ASR to near zero on the smaller suite but none fully prevent leakage on the expanded 48-task set. Models rarely disclose passwords alone, but password leakage rises when passwords are requested together with other personal fields. Tasks that retrieve data or

Problem Statement

Tool-using LLM agents can be tricked by injected text inside inputs (prompt injection) into revealing personal data they saw while performing a task. The paper asks: how effective are simple data-flow prompt injections at exfiltrating agent-observed data, which models and tasks are most vulnerable, and which lightweight defenses reduce leakage?

Main Contribution

Design of data-flow prompt injection attacks that target data exfiltration from tool-calling agents.

Integration of those attacks into AgentDojo's banking suite to measure leakage on practical user tasks.

Key Findings

Average attack success rate (ASR) across models and tasks is around 15–20%.

NumbersASR ≈15% (48 tasks) and ≈20% (16 tasks); Llama-4 (17B) hit 40% on 16 tasks.

Practical UseExpect measurable leakage from straightforward injections; run task-level ASR tests before deployment.

Evidence RefAbstract; Figures 2,6; text in Sections 4.1 and 4.4

Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.

NumbersPassword-only ASR ≈0% for most models; 'Password + 1' shows up to 18.75% (GPT-3.5) and 12.5% (Llama-4 17B).

Practical UseDo not assume models won't leak credentials; avoid exposing passwords alongside other user fields and treat combined requests as higher risk.

Evidence RefSection 4.1; Figure 13; paragraph on password leakage rates

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average targeted ASR (16 AgentDojo tasks)≈20% (most models); Llama-4 (17B) = 40%AgentDojo banking 16 tasksSection 4.1; Figures 2 and 3Figures 2,3; text in 4.1
Average targeted ASR (expanded 48 tasks)≈15%Expanded synthetic + AgentDojo 48 tasksSection 4.4; Figure 6Figure 6; Table 7

What To Try In 7 Days

Run AgentDojo-style tests (or the paper's 48-task suite) against your agent to measure ASR and utility under attack.

Add a lightweight prompt-injection detector and a repeat-user-prompt check; measure the utility trade-off.

Audit flows that return multiple personal fields together (especially passwords) and avoid bundling credentials with other data.

Agent Features

Memory
short-term execution observations (data seen during task)
Planning
multi-step task planning
Tool Use
email toolaccount lookup functionscalendar/storage/email style tools
Frameworks
AgentDojo
Is Agentic

Yes

Architectures
LLM with function-calling
Collaboration
single-agent interacting with external tools

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

AgentDojo GitHub (paper states AgentDojo repo contains code/data)

Risks & Boundaries

Limitations

Threat model assumes attacker can inject content into application inputs and may know the retrieval system (white-box), which is stronger than some production threat models.

Synthetic dataset approximates real conversations; results may differ on real user data.

When Not To Use

If your agent has strong design-level isolation like CaMeL-style architectures (not evaluated here).

If your production system never exposes attacker-controlled content to the agent.

Failure Modes

Adding detectors reduces agent utility and may break legitimate flows.

Defenses tuned on a small task set may fail on broader or domain-specific prompts.

Core Entities

Models

GPT-3.5 TurboGPT-4 TurboGPT-4oLlama-3 (70B)Llama-4 (17B)Claude 3.5 Sonnet

Metrics

Benign UtilityUtility Under AttackTargeted Attack Success Rate (ASR)Password leakage rate

Datasets

AgentDojo banking suiteSynthetic banking conversation dataset (this paper)

Benchmarks

AgentDojo