Straightforward prompt injections can make tool-using LLM agents leak user data seen during a task.

Overview

Decision SnapshotNeeds Validation

The attacks are practical and tested across multiple off-the-shelf models and an expanded task set, but results rely on synthetic data and a specific threat model; defenses show promise but reduce utility.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, Fabrizio Gilardi

Links

Abstract / PDF / Code

Why It Matters For Business

If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist CEO

Summary TLDR

The paper builds data-flow prompt-injection attacks that target tool-using LLM agents (a banking agent). The authors integrate these attacks into the AgentDojo benchmark, create a richer synthetic banking conversation dataset, and test six LLMs across 16 then 48 user tasks. Average targeted attack success rates (ASR) are ~15–20% across evaluated models and tasks; some defenses (prompt-injection detector, repeating user prompt) can reduce ASR to near zero on the smaller suite but none fully prevent leakage on the expanded 48-task set. Models rarely disclose passwords alone, but password leakage rises when passwords are requested together with other personal fields. Tasks that retrieve data or

Problem Statement

Tool-using LLM agents can be tricked by injected text inside inputs (prompt injection) into revealing personal data they saw while performing a task. The paper asks: how effective are simple data-flow prompt injections at exfiltrating agent-observed data, which models and tasks are most vulnerable, and which lightweight defenses reduce leakage?

Main Contribution

Design of data-flow prompt injection attacks that target data exfiltration from tool-calling agents.

Integration of those attacks into AgentDojo's banking suite to measure leakage on practical user tasks.

Key Findings

Average attack success rate (ASR) across models and tasks is around 15–20%.

NumbersASR ≈15% (48 tasks) and ≈20% (16 tasks); Llama-4 (17B) hit 40% on 16 tasks.

Practical UseExpect measurable leakage from straightforward injections; run task-level ASR tests before deployment.

Evidence RefAbstract; Figures 2,6; text in Sections 4.1 and 4.4

Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.

NumbersPassword-only ASR ≈0% for most models; 'Password + 1' shows up to 18.75% (GPT-3.5) and 12.5% (Llama-4 17B).

Practical UseDo not assume models won't leak credentials; avoid exposing passwords alongside other user fields and treat combined requests as higher risk.

Evidence RefSection 4.1; Figure 13; paragraph on password leakage rates

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average targeted ASR (16 AgentDojo tasks)	≈20% (most models); Llama-4 (17B) = 40%	—	—	AgentDojo banking 16 tasks	Section 4.1; Figures 2 and 3	Figures 2,3; text in 4.1
Average targeted ASR (expanded 48 tasks)	≈15%	—	—	Expanded synthetic + AgentDojo 48 tasks	Section 4.4; Figure 6	Figure 6; Table 7

What To Try In 7 Days

Run AgentDojo-style tests (or the paper's 48-task suite) against your agent to measure ASR and utility under attack.

Add a lightweight prompt-injection detector and a repeat-user-prompt check; measure the utility trade-off.

Audit flows that return multiple personal fields together (especially passwords) and avoid bundling credentials with other data.

Agent Features

Memory

short-term execution observations (data seen during task)

Planning

multi-step task planning

Tool Use

email toolaccount lookup functionscalendar/storage/email style tools

Frameworks

AgentDojo

Is Agentic

Yes

Architectures

LLM with function-calling

Collaboration

single-agent interacting with external tools

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

AgentDojo GitHub (paper states AgentDojo repo contains code/data)

Risks & Boundaries

Limitations

Threat model assumes attacker can inject content into application inputs and may know the retrieval system (white-box), which is stronger than some production threat models.

Synthetic dataset approximates real conversations; results may differ on real user data.

When Not To Use

If your agent has strong design-level isolation like CaMeL-style architectures (not evaluated here).

If your production system never exposes attacker-controlled content to the agent.

Failure Modes

Adding detectors reduces agent utility and may break legitimate flows.

Defenses tuned on a small task set may fail on broader or domain-specific prompts.

Core Entities

Models

GPT-3.5 TurboGPT-4 TurboGPT-4oLlama-3 (70B)Llama-4 (17B)Claude 3.5 Sonnet

Metrics

Benign UtilityUtility Under AttackTargeted Attack Success Rate (ASR)Password leakage rate

Datasets

AgentDojo banking suiteSynthetic banking conversation dataset (this paper)

Benchmarks

AgentDojo

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Average attack success rate (ASR) across models and tasks is around 15–20%.

Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding