INJECAGENT: 1,054 realistic tests that measure how tool-enabled LLM agents can be hijacked by malicious content

Overview

Decision SnapshotReady For Pilot

The benchmark is comprehensive and tested on 30 agents, giving strong evidence of vulnerabilities; results are limited to single-turn tests and a fixed enhanced prompt.

Citations5

Evidence Strength0.90

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-enabled LLM agents can be hijacked by content they retrieve, causing unauthorized transactions or data leaks; firms must test agents with realistic IPI cases before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper introduces INJECAGENT, a public benchmark of 1,054 test cases (17 user tools × 62 attacker cases) that measures how tool-enabled LLM agents respond when external content contains malicious instructions (indirect prompt injection, IPI). Evaluating 30 agents, the authors find prompted agents are often vulnerable (e.g., ReAct-prompted GPT‑4 ASR-valid 24% base, 47% with a 'hacking' prompt) while fine-tuning for tool calls substantially lowers vulnerability (fine-tuned GPT‑4 ASR-valid ≈ 6.6%–7.1%). Key practical lessons: reduce free-text placeholders, avoid blindly concatenating untrusted content, and prefer tool-call fine-tuning or stricter output checks.

Problem Statement

LLM agents are being given tools and access to external content. That content can contain hidden instructions from attackers (indirect prompt injection, IPI) that redirect agents to perform harmful actions or leak private data. We lack a systematic, realistic benchmark to measure these risks for tool‑integrated agents and to compare defenses.

Main Contribution

Formalize indirect prompt injection (IPI) against tool-integrated LLM agents and define measurable attack success.

Release INJECAGENT: 1,054 realistic test cases combining 17 user-facing tools and 62 attacker instructions, with base and enhanced (hacking-prompt) settings.

Key Findings

INJECAGENT covers 1,054 test cases built from 17 user tools and 62 attacker instructions.

Numbers1,054 cases; 17 user tools; 62 attacker cases

Practical UseUse this benchmark to measure agent risk across common tool types before deployment.

Evidence RefTable 2, Abstract

Prompted GPT‑4 (ReAct) is vulnerable: ASR-valid = 24% (base) and 47% (enhanced with hacking prompt).

NumbersGPT-4 ASR-valid 24% → 47% (base → enhanced)

Practical UseAgents relying on prompt engineering are meaningfully exploitable; do not assume safety from prompt-only defenses.

Evidence RefTable 3, Section 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Benchmark size	1,054 test cases (17 user × 62 attacker)	—	—	INJECAGENT	Table 2: 17 user cases × 62 attacker cases = 1,054	Table 2
Prompted GPT-4 ASR-valid (vulnerable rate)	24% (base); 47% (enhanced)	—	↑23pp	INJECAGENT	Table 3 reports GPT-4 ASR-valid 24% base and 47% enhanced	Table 3

What To Try In 7 Days

Run INJECAGENT or a subset on your agent to get an IPI risk baseline.

Identify high-content-freedom integrations (free-text fields) and apply strict parsing or sanitization.

Add mandatory user confirmation for any high-risk tool call (payments, locks, data exports).

Agent Features

Memory

Short-term scratchpad for current session

Planning

Single-turn tool invocation traces (scratchpad)

Tool Use

Function calling (API/tool invocation)Tool sequencing for data extraction and transmission

Frameworks

ReActLangChain-style promptsOpenAI function calling

Is Agentic

Yes

Architectures

Prompted agents (ReAct)Fine-tuned function-calling models

Collaboration

No multi-agent coordination evaluated

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/uiuc-kang-lab/InjecAgent

Data URLs

https://github.com/uiuc-kang-lab/InjecAgent

Risks & Boundaries

Limitations

Enhanced setting uses a single fixed hacking prompt; other prompts may behave differently.

Test cases are single-turn and limit attacker instructions to at most two steps.

When Not To Use

For multi-turn attack simulations involving long adversarial dialogues.

To evaluate prompt-injection defenses that require model parameter edits not covered by this benchmark.

Failure Modes

Model outputs that do not follow the ReAct format are excluded and reduce measurement coverage.

Attacker instructions missing required tool parameters can produce false-negative attack failures.

Core Entities

Models

GPT-4GPT-3.5Llama2-70BClaude-2Qwen-72BMistral-7B

Metrics

ASR-validASR-allSensitivity RateValid Rate

Datasets

INJECAGENT

Benchmarks

INJECAGENT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

INJECAGENT covers 1,054 test cases built from 17 user tools and 62 attacker instructions.

Prompted GPT‑4 (ReAct) is vulnerable: ASR-valid = 24% (base) and 47% (enhanced with hacking prompt).

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding