Build a tool-plan (TDG) and block unexpected tool calls to stop indirect prompt injections

Overview

Decision SnapshotReady For Pilot

Method is a practical system-level defense with clear mechanisms and benchmarked gains. Results are robust across models but evaluated at limited scale and with token/latency overhead.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 65%

Authors

Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your agent can call external services, hidden instructions in returned content can trigger harmful actions. IPIGUARD stops many such attacks by pre-planning allowed tool calls and blocking unapproved ones, trading modest extra cost for much stronger protection.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

IPIGUARD shifts defense from model prompting to controlling execution. It first asks the agent to plan all tool calls as a Tool Dependency Graph (TDG), then strictly follows that plan while allowing limited, safe read-only expansions and simulated ('fake') tool responses to avoid adversarial changes. On the AgentDojo benchmark, this reduces attacker success to about 0.7% while keeping user utility near the undefended level, at the cost of ~2x token usage.

Problem Statement

LLM agents calling external tools can be hijacked by hidden instructions inside tool outputs (Indirect Prompt Injection, IPI). Existing defenses tweak prompts or add detectors but still let agents call any tool at execution time, so attackers can trigger harmful tool calls. The paper asks: can we stop IPI at the source by pre-planning and forbidding unapproved tool invocations?

Main Contribution

Introduce IPIGUARD: enforce a planned Tool Dependency Graph (TDG) so execution follows a pre-approved ordered plan rather than freely invoking tools at runtime.

Design Argument Estimation, Node Expansion, and Fake Tool Invocation to handle unknown arguments, permit safe read-only extra queries, and neutralize injected instructions that reuse legitimate tools.

Key Findings

IPIGUARD reduces average targeted attack success rate (ASR)

NumbersASR ≈ 0.69% average on AgentDojo (Table 1)

Practical UseUse TDG enforcement to cut IPI attack success to <1% on evaluated benchmarks; good choice when preventing concrete harmful tool use matters.

Evidence RefTable 1

Benign utility is preserved near the undefended baseline

NumbersBenign Utility (BU) 67.01% vs no-defense 68.04% (Figure 5 / text)

Practical UseYou can add IPIGUARD without large drops in normal task success; expect only small utility loss in most domains.

Evidence RefFigure 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average Attack Success Rate (ASR)	0.69%	No defense 13.16% (overall in Table 1)	≈ -12.47 pp	AgentDojo, overall	IPIGUARD average ASR 0.69% vs no defense 13.16% across AgentDojo	Table 1
Benign Utility (BU)	67.01%	No defense 68.04%	-1.03 pp	Tasks without attack (AgentDojo)	IPIGUARD maintains BU ~67% vs 68% no-defense	Figure 5 / text

What To Try In 7 Days

Run a smoke test: build a simple TDG for a 3-step workflow and enforce executor to follow it.

Enable Node Expansion for read-only queries only, to preserve utility while blocking writes.

Simulate Fake Tool Invocation for one overlapping-tool scenario to verify it prevents argument hijacks.

Agent Features

Memory

execution context stores prior tool responses for argument estimation

Planning

TDG construction (plan all tool calls up front)separate planning LLM and execution LLM allowed

Tool Use

enforce pre-approved tool callsQuery vs Command tool classification

Frameworks

Command-Query Responsibility Segregation (inspired classification)

Is Agentic

Yes

Architectures

tool-augmented LLM agent (planner + executor)

Collaboration

planner and executor can be different LLMs

Optimization Features

Token Efficiency

planning accounts for ~20% of token usage; allow stronger planners without large token cost

Inference Optimization

use weaker executor + stronger planner to reduce cost (evaluated)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

code mentioned as available in paper; exact URL not provided in text

Data URLs

https://agentdojo.spylab.ai

Risks & Boundaries

Limitations

Focuses on tool-invocation attacks (IPI) and not on purely textual manipulations that don't trigger tools.

Experiment scale limited by LLM query costs; fewer models and cases than fully exhaustive studies.

When Not To Use

When attacker only aims to subtly change text outputs without causing tool actions (different threat model).

In highly resource-constrained deployments where 2x token usage and extra latency are unacceptable.

Failure Modes

Fake Tool Invocation might fail in corner cases, allowing argument hijacks.

Errors in TDG construction (missing necessary nodes) can block legitimate functionality.

Core Entities

Models

GPT-4oGPT-4o-miniClaude 3.5 SonnetQwen2.5-7B-InstructQwen3-32Bo4-mini

Metrics

Benign Utility (BU)Utility under Attack (UA)Attack Success Rate (ASR)Input TokensOutput TokensTask Time (s)

Datasets

AgentDojo

Benchmarks

AgentDojo

Context Entities

Models

other closed-source and open-source LLMs referenced for comparison

Metrics

BU, UA, ASR, token/time overhead

Datasets

AgentDojo (public link in paper)

Benchmarks

AgentDojo

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IPIGUARD reduces average targeted attack success rate (ASR)

Benign utility is preserved near the undefended baseline

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding