Overview
The benchmark is comprehensive and tested on 30 agents, giving strong evidence of vulnerabilities; results are limited to single-turn tests and a fixed enhanced prompt.
Citations5
Evidence Strength0.90
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Tool-enabled LLM agents can be hijacked by content they retrieve, causing unauthorized transactions or data leaks; firms must test agents with realistic IPI cases before deployment.
Who Should Care
Summary TLDR
This paper introduces INJECAGENT, a public benchmark of 1,054 test cases (17 user tools × 62 attacker cases) that measures how tool-enabled LLM agents respond when external content contains malicious instructions (indirect prompt injection, IPI). Evaluating 30 agents, the authors find prompted agents are often vulnerable (e.g., ReAct-prompted GPT‑4 ASR-valid 24% base, 47% with a 'hacking' prompt) while fine-tuning for tool calls substantially lowers vulnerability (fine-tuned GPT‑4 ASR-valid ≈ 6.6%–7.1%). Key practical lessons: reduce free-text placeholders, avoid blindly concatenating untrusted content, and prefer tool-call fine-tuning or stricter output checks.
Problem Statement
LLM agents are being given tools and access to external content. That content can contain hidden instructions from attackers (indirect prompt injection, IPI) that redirect agents to perform harmful actions or leak private data. We lack a systematic, realistic benchmark to measure these risks for tool‑integrated agents and to compare defenses.
Main Contribution
Formalize indirect prompt injection (IPI) against tool-integrated LLM agents and define measurable attack success.
Release INJECAGENT: 1,054 realistic test cases combining 17 user-facing tools and 62 attacker instructions, with base and enhanced (hacking-prompt) settings.
Key Findings
INJECAGENT covers 1,054 test cases built from 17 user tools and 62 attacker instructions.
Prompted GPT‑4 (ReAct) is vulnerable: ASR-valid = 24% (base) and 47% (enhanced with hacking prompt).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Benchmark size | 1,054 test cases (17 user × 62 attacker) | — | — | INJECAGENT | Table 2: 17 user cases × 62 attacker cases = 1,054 | Table 2 |
| Prompted GPT-4 ASR-valid (vulnerable rate) | 24% (base); 47% (enhanced) | — | ↑23pp | INJECAGENT | Table 3 reports GPT-4 ASR-valid 24% base and 47% enhanced | Table 3 |
What To Try In 7 Days
Run INJECAGENT or a subset on your agent to get an IPI risk baseline.
Identify high-content-freedom integrations (free-text fields) and apply strict parsing or sanitization.
Add mandatory user confirmation for any high-risk tool call (payments, locks, data exports).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Enhanced setting uses a single fixed hacking prompt; other prompts may behave differently.
Test cases are single-turn and limit attacker instructions to at most two steps.
When Not To Use
For multi-turn attack simulations involving long adversarial dialogues.
To evaluate prompt-injection defenses that require model parameter edits not covered by this benchmark.
Failure Modes
Model outputs that do not follow the ReAct format are excluded and reduce measurement coverage.
Attacker instructions missing required tool parameters can produce false-negative attack failures.

