A practical security and implementation guide for Plan‑then‑Execute LLM agents

September 10, 20257 min

Overview

Decision SnapshotReady For Pilot

The paper combines prior research, security reasoning, and runnable examples to give immediately usable patterns; empirical claims are supported by cited experiments or framework demos.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/2

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Ron F. Del Rosario, Klaudia Krawiecka, Christian Schroeder de Witt

Links

Abstract / PDF

Why It Matters For Business

P‑t‑E gives auditable, predictable automation and architectural defenses against prompt injection, lowering risk for regulated or high‑value workflows.

Who Should Care

Summary TLDR

This paper argues that splitting agent behavior into a Planner (one strong LLM) and an Executor (simpler component) yields more predictable, cost‑efficient, and secure LLM agents than reactive loops. It gives a security-first blueprint: lock the plan before running tools, scope tool access per task, sandbox code execution (Docker), and add verifier/HITL and re-planning loops. The guide includes runnable examples and framework-specific recipes for LangGraph, CrewAI, and AutoGen. Use P-t-E for multi-step, auditable workflows; beware upfront token/latency costs and add layered controls.

Problem Statement

LLM agents that call tools can be powerful but are vulnerable, unpredictable, and costly when implemented as stepwise reactive loops. The paper asks: how do architects build agentic systems that are predictable, cost‑effective, and resistant to prompt‑injection and unsafe tool use?

Main Contribution

Clear exposition of the Plan‑then‑Execute (P‑t‑E) pattern: planner vs executor vs optional verifier/refiner

Security blueprint: control-flow integrity, least privilege tool scoping, input/output sanitization, Docker sandboxing, and HITL/Plan‑Validate‑Execute

Key Findings

Plan‑then‑Execute locks control flow before ingesting untrusted tool outputs, reducing risk of indirect prompt injection.

Practical UseDesign the Planner to emit a full plan first; prevent tool outputs from changing the approved action sequence.

Evidence RefSection 2.1

P‑t‑E front-loads LLM reasoning to one or few planner calls instead of one LLM call per action.

NumbersPlanner: 1 call vs ReAct: 1 call per step

Practical UseUse a strong planner LLM sparingly and execute steps with cheaper code or smaller models to cut API cost and latency overall.

Evidence RefSections 1.2, 1.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Parallel execution speedupup to 3.6×sequential P-t-E execution3.6× faster on I/O-bound tasksI/O-bound tasks (cited experiment)Section 7.2.1 reports up to 3.6× speed boost from DAG parallelismSection 7.2.1
Planner token consumption3,0004,500 tokens per plan callsingle-step ReAct token usage for simple tasksFront-loaded token costComplex planning prompts (paper estimate)Section 7.4.2 states planning calls can use 3,000–4,500 tokensSection 7.4.2

What To Try In 7 Days

Prototype a Planner + simple Executor for a 3‑step internal workflow to measure token and latency trade-offs

Add task-level tool scoping to an existing multi-tool agent to test least‑privilege enforcement

Enable Dockerized sandboxing for any code-executing agents and run a controlled exploit test on a dev sandbox

Agent Features

Memory
Persistent state object with past_steps historyTypedDict/StateGraph passed between nodes
Planning
Single-shot Planner with structured outputRe-planning loopsDAG-based parallel planningHierarchical/sub-planner decomposition
Tool Use
Task-scoped tool provisioningRBAC-style role mappingGraphQL queries as compact tools
Frameworks
LangChainLangGraphCrewAIAutoGen
Is Agentic

Yes

Architectures
Planner-ExecutorHierarchical (manager-worker)Stateful Graph (LangGraph)
Collaboration
Manager-worker delegation (CrewAI)Group chat orchestration with speaker selection (AutoGen)

Optimization Features

Token Efficiency
GraphQL to reduce returned fields and tokensSub-planners to lower single-call token budgets
Infra Optimization
Docker sandboxing for code executionTiered sandboxing based on task role
System Optimization
DAG-based parallel executionExecutor use of smaller models or deterministic functions

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High upfront latency and token cost for the Planner phase

A flawed plan remains dangerous—Plan validation is required for high-risk tasks

When Not To Use

Single-step, low-latency queries where time-to-first-action matters

Simple chatbots or one-off Q&A where ReAct is cheaper and faster

Failure Modes

Convincingly wrong plan approved and executed (human oversight applied too late)

Malicious tool output contaminates data passed between steps, causing downstream harm

Core Entities

Models

GPT-4Claude 3 Opusgpt-4o

Metrics

execution latencytoken consumptionparallel speedup

Context Entities

Models

gpt-4o-mini