A practical security and implementation guide for Plan‑then‑Execute LLM agents

Overview

Decision SnapshotReady For Pilot

The paper combines prior research, security reasoning, and runnable examples to give immediately usable patterns; empirical claims are supported by cited experiments or framework demos.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/2

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Ron F. Del Rosario, Klaudia Krawiecka, Christian Schroeder de Witt

Links

Abstract / PDF

Why It Matters For Business

P‑t‑E gives auditable, predictable automation and architectural defenses against prompt injection, lowering risk for regulated or high‑value workflows.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper argues that splitting agent behavior into a Planner (one strong LLM) and an Executor (simpler component) yields more predictable, cost‑efficient, and secure LLM agents than reactive loops. It gives a security-first blueprint: lock the plan before running tools, scope tool access per task, sandbox code execution (Docker), and add verifier/HITL and re-planning loops. The guide includes runnable examples and framework-specific recipes for LangGraph, CrewAI, and AutoGen. Use P-t-E for multi-step, auditable workflows; beware upfront token/latency costs and add layered controls.

Problem Statement

LLM agents that call tools can be powerful but are vulnerable, unpredictable, and costly when implemented as stepwise reactive loops. The paper asks: how do architects build agentic systems that are predictable, cost‑effective, and resistant to prompt‑injection and unsafe tool use?

Main Contribution

Clear exposition of the Plan‑then‑Execute (P‑t‑E) pattern: planner vs executor vs optional verifier/refiner

Security blueprint: control-flow integrity, least privilege tool scoping, input/output sanitization, Docker sandboxing, and HITL/Plan‑Validate‑Execute

Key Findings

Plan‑then‑Execute locks control flow before ingesting untrusted tool outputs, reducing risk of indirect prompt injection.

Practical UseDesign the Planner to emit a full plan first; prevent tool outputs from changing the approved action sequence.

Evidence RefSection 2.1

P‑t‑E front-loads LLM reasoning to one or few planner calls instead of one LLM call per action.

NumbersPlanner: 1 call vs ReAct: 1 call per step

Practical UseUse a strong planner LLM sparingly and execute steps with cheaper code or smaller models to cut API cost and latency overall.

Evidence RefSections 1.2, 1.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Parallel execution speedup	up to 3.6×	sequential P-t-E execution	3.6× faster on I/O-bound tasks	I/O-bound tasks (cited experiment)	Section 7.2.1 reports up to 3.6× speed boost from DAG parallelism	Section 7.2.1
Planner token consumption	3,000–4,500 tokens per plan call	single-step ReAct token usage for simple tasks	Front-loaded token cost	Complex planning prompts (paper estimate)	Section 7.4.2 states planning calls can use 3,000–4,500 tokens	Section 7.4.2

What To Try In 7 Days

Prototype a Planner + simple Executor for a 3‑step internal workflow to measure token and latency trade-offs

Add task-level tool scoping to an existing multi-tool agent to test least‑privilege enforcement

Enable Dockerized sandboxing for any code-executing agents and run a controlled exploit test on a dev sandbox

Agent Features

Memory

Persistent state object with past_steps historyTypedDict/StateGraph passed between nodes

Planning

Single-shot Planner with structured outputRe-planning loopsDAG-based parallel planningHierarchical/sub-planner decomposition

Tool Use

Task-scoped tool provisioningRBAC-style role mappingGraphQL queries as compact tools

Frameworks

LangChainLangGraphCrewAIAutoGen

Is Agentic

Yes

Architectures

Planner-ExecutorHierarchical (manager-worker)Stateful Graph (LangGraph)

Collaboration

Manager-worker delegation (CrewAI)Group chat orchestration with speaker selection (AutoGen)

Optimization Features

Token Efficiency

GraphQL to reduce returned fields and tokensSub-planners to lower single-call token budgets

Infra Optimization

Docker sandboxing for code executionTiered sandboxing based on task role

System Optimization

DAG-based parallel executionExecutor use of smaller models or deterministic functions

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

High upfront latency and token cost for the Planner phase

A flawed plan remains dangerous—Plan validation is required for high-risk tasks

When Not To Use

Single-step, low-latency queries where time-to-first-action matters

Simple chatbots or one-off Q&A where ReAct is cheaper and faster

Failure Modes

Convincingly wrong plan approved and executed (human oversight applied too late)

Malicious tool output contaminates data passed between steps, causing downstream harm

A practical security and implementation guide for Plan‑then‑Execute LLM agents

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Plan‑then‑Execute locks control flow before ingesting untrusted tool outputs, reducing risk of indirect prompt injection.

P‑t‑E front-loads LLM reasoning to one or few planner calls instead of one LLM call per action.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Plan‑then‑Execute locks control flow before ingesting untrusted tool outputs, reducing risk of indirect prompt injection.

P‑t‑E front-loads LLM reasoning to one or few planner calls instead of one LLM call per action.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding