Straightforward prompt injections can make tool-using LLM agents leak user data seen during a task.

June 1, 20258 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, Fabrizio Gilardi

Links

Abstract / PDF

Why It Matters For Business

If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.

Summary TLDR

The paper builds data-flow prompt-injection attacks that target tool-using LLM agents (a banking agent). The authors integrate these attacks into the AgentDojo benchmark, create a richer synthetic banking conversation dataset, and test six LLMs across 16 then 48 user tasks. Average targeted attack success rates (ASR) are ~15–20% across evaluated models and tasks; some defenses (prompt-injection detector, repeating user prompt) can reduce ASR to near zero on the smaller suite but none fully prevent leakage on the expanded 48-task set. Models rarely disclose passwords alone, but password leakage rises when passwords are requested together with other personal fields. Tasks that retrieve data or

Problem Statement

Tool-using LLM agents can be tricked by injected text inside inputs (prompt injection) into revealing personal data they saw while performing a task. The paper asks: how effective are simple data-flow prompt injections at exfiltrating agent-observed data, which models and tasks are most vulnerable, and which lightweight defenses reduce leakage?

Main Contribution

Design of data-flow prompt injection attacks that target data exfiltration from tool-calling agents.

Integration of those attacks into AgentDojo's banking suite to measure leakage on practical user tasks.

Creation of a richer synthetic human–AI banking conversation dataset and an expanded 48-task evaluation set.

Key Findings

Average attack success rate (ASR) across models and tasks is around 15–20%.

NumbersASR ≈15% (48 tasks) and ≈20% (16 tasks); Llama-4 (17B) hit 40% on 16 tasks.

Most models rarely leak passwords when asked alone, but leakage rises when passwords are requested together with other fields.

NumbersPassword-only ASR ≈0% for most models; 'Password + 1' shows up to 18.75% (GPT-3.5) and 12.5% (Llama-4 17B).

Defenses can reduce ASR to near-zero but often at the cost of reduced benign utility.

NumbersOn 16 tasks GPT-4o: PI detector and Repeat Prompt ASR = 0%; tool filter ASR = 3.1%; benign utility falls (e.g., Repeat:

Attack effectiveness depends on task type: data-retrieval and authorization tasks are more vulnerable.

NumbersAccount Info and Profile/Auth groups show higher ASR (~>15%) and lower utility under attack vs other groups.

Attack phrasing and attacker knowledge matter: 'Important message' and adaptive (Max) phrasing are more effective.

Numbers'Important message' outperforming others; adaptive Max adds ~2.5% ASR; correct user/model names raise ASR by ~4.1%.

Results

Average targeted ASR (16 AgentDojo tasks)

Value≈20% (most models); Llama-4 (17B) = 40%

Average targeted ASR (expanded 48 tasks)

Value≈15%

Utility drop under attack

Value15%–50% absolute drop for many LLMs

BaselineBenign utility

Defense effectiveness (GPT-4o, 16 tasks)

ValuePI detector & Repeat prompt: ASR = 0%; Tool filter ASR = 3.1%

BaselineNo defense ASR = 7.8%

Defense effectiveness (GPT-4o, 48 tasks)

ValuePI detector ASR ≈1.5%; Tool filter ASR ≈1.0%; No defense ASR ≈11.4%

BaselineNo defense

Who Should Care

What To Try In 7 Days

Run AgentDojo-style tests (or the paper's 48-task suite) against your agent to measure ASR and utility under attack.

Add a lightweight prompt-injection detector and a repeat-user-prompt check; measure the utility trade-off.

Audit flows that return multiple personal fields together (especially passwords) and avoid bundling credentials with other data.

Agent Features

Memory

  • short-term execution observations (data seen during task)

Planning

  • multi-step task planning

Tool Use

  • email tool
  • account lookup functions
  • calendar/storage/email style tools

Frameworks

  • AgentDojo

Is Agentic

true

Architectures

  • LLM with function-calling

Collaboration

  • single-agent interacting with external tools

Reproducibility

Code Urls

  • AgentDojo GitHub (paper states AgentDojo repo contains code/data)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Threat model assumes attacker can inject content into application inputs and may know the retrieval system (white-box), which is stronger than some production threat models.
  • Synthetic dataset approximates real conversations; results may differ on real user data.
  • Experiments focus on a banking environment; other domains (insurance, crypto, stocks) may behave differently.
  • Dataset and tool implementations influence measured ASR; results are sensitive to task wording and model system prompts.

When Not To Use

  • If your agent has strong design-level isolation like CaMeL-style architectures (not evaluated here).
  • If your production system never exposes attacker-controlled content to the agent.
  • If your model architecture does not use function-calling or external tools.

Failure Modes

  • Adding detectors reduces agent utility and may break legitimate flows.
  • Defenses tuned on a small task set may fail on broader or domain-specific prompts.
  • Model updates can change leakage behavior, invalidating prior tests.
  • Simulated attacker knowledge (name guessing) underestimates skilled adversaries.

Core Entities

Models

  • GPT-3.5 Turbo
  • GPT-4 Turbo
  • GPT-4o
  • Llama-3 (70B)
  • Llama-4 (17B)
  • Claude 3.5 Sonnet

Metrics

  • Benign Utility
  • Utility Under Attack
  • Targeted Attack Success Rate (ASR)
  • Password leakage rate

Datasets

  • AgentDojo banking suite
  • Synthetic banking conversation dataset (this paper)

Benchmarks

  • AgentDojo