AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

April 6, 20257 min

Overview

Decision SnapshotReady For Pilot

AutoPDL is practical: it produces editable prompt programs and shows clear gains, but it requires non-trivial compute to evaluate many candidates and depends on quality of demonstration templates.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

Links

Abstract / PDF / Code

Why It Matters For Business

AutoPDL automates prompt and agent-pattern selection and outputs editable, executable prompt programs, reducing manual tuning time and allowing reuse across models while sometimes improving accuracy substantially.

Who Should Care

Summary TLDR

AutoPDL frames prompt engineering as an AutoML problem: it jointly searches prompting patterns (Zero-Shot, CoT, ReAct, ReWOO) and prompt content (instructions and few-shot examples) expressed in a human-editable PDL program. Using successive halving to prune candidates, AutoPDL returns executable PDL programs. Evaluated on FEVER, GSM8K, GSM-Hard and MBPP+ across seven models (3B–70B), it yields average gains (9.21 ± 15.46 percentage points) and up to 67.5pp on a single run. Optimized PDL programs are human-readable and can transfer to stronger models in some cases.

Problem Statement

Prompt performance depends on both the high-level prompting pattern and the concrete prompt content. Manual tuning is slow, model-specific, and hard to reuse. The problem is to automatically find a prompting pattern and prompt (including few-shot examples) that minimize task loss, and to return a result that is both executable and editable by humans.

Main Contribution

Formulate joint search over prompting patterns and prompt content and solve it with AutoML.

Provide a pattern library and a PDL-based source-to-source optimizer that outputs human-readable, executable prompt programs.

Key Findings

AutoPDL improved accuracy on evaluated benchmarks on average.

NumbersMean gain 9.21 ± 15.46 percentage points (accuracy)

Practical UseRun AutoPDL to likely improve model accuracy over zero-shot on similar classification/generation tasks.

Evidence RefAbstract; Table 1

The largest single improvement observed was huge but rare.

NumbersUp to +67.5 percentage points (Granite 13B Instruct V2 on FEVER)

Practical UseFor weak or misaligned models, AutoPDL can sometimes produce very large gains—worth trying when baseline is poor.

Evidence RefTable 1 (FEVER: Granite 13B Instruct V2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy9.21 ± 15.46 ppZero-shot+9.21ppAggregate across FEVER, GSM8K, MBPP+Abstract; Table 1Abstract; Table 1
Maximum single-run improvement+67.5 ppZero-shot+67.5ppFEVER, Granite 13B Instruct V2Table 1 (FEVER)Table 1

What To Try In 7 Days

Clone the PDL repo and run AutoPDL on one model-task pair to compare to your zero-shot baseline.

Inspect and hand-edit the returned PDL program to match your toolchain and rerun on validation data.

If you have low resources, optimize on an open model and test the result once on a frontier API.

Agent Features

Memory
Few-shot demonstration bank (example bank)In-context learning (no fine-tuning)
Planning
Agentic TAO loop (Thought-Action-Observation)ReWOO (reasoning without observations)
Tool Use
Calc (SymPy evaluator)Search (Wikipedia summary)Execute (Python execution)Finish (end trajectory)
Frameworks
PDL (Prompt Declaration Language)AutoPDL
Is Agentic

Yes

Architectures
PDL prompt programsLLM agent patterns (ReAct, ReWOO, CoT, Zero-Shot)
Collaboration
Human-in-the-loop editing of PDL programs

Optimization Features

Token Efficiency
Limit number of demos to {0,3,5} to control context size
System Optimization
Source-to-source optimization: search space and result are PDL programsSuccessive halving to prune candidates cheaply

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Optimization is compute-intensive: evaluates many candidates with multiple LLM calls.

Search space limited (demo counts only {0,3,5}); other choices may yield better results.

When Not To Use

You lack budget for repeated LLM calls or large-scale candidate evaluation.

You need instant deployment without time to validate optimized prompts.

Failure Modes

Optimizer returns zero-shot baseline when search space lacks helpful options.

Template-generated demonstrations mislead the search and degrade performance.

Core Entities

Models

LLaMA 3.1 8BLLaMA 3.2 3BLLaMA 3.3 70BGranite 3.1 8BGranite 13B Instruct V2Granite 20B CodeGranite 34B Codegpt-4o-mini-2024-07-18

Metrics

AccuracyAbsolute percentage-point delta

Datasets

FEVERGSM8KGSM-HardMBPP+MBPP

Benchmarks

FEVER (binary fact verification)GSM8K (grade-school math)GSM-Hard (hardified GSM8K)MBPP+ (code generation with extended tests)