AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Overview

Decision SnapshotReady For Pilot

AutoPDL is practical: it produces editable prompt programs and shows clear gains, but it requires non-trivial compute to evaluate many candidates and depends on quality of demonstration templates.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

Links

Abstract / PDF / Code

Why It Matters For Business

AutoPDL automates prompt and agent-pattern selection and outputs editable, executable prompt programs, reducing manual tuning time and allowing reuse across models while sometimes improving accuracy substantially.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

AutoPDL frames prompt engineering as an AutoML problem: it jointly searches prompting patterns (Zero-Shot, CoT, ReAct, ReWOO) and prompt content (instructions and few-shot examples) expressed in a human-editable PDL program. Using successive halving to prune candidates, AutoPDL returns executable PDL programs. Evaluated on FEVER, GSM8K, GSM-Hard and MBPP+ across seven models (3B–70B), it yields average gains (9.21 ± 15.46 percentage points) and up to 67.5pp on a single run. Optimized PDL programs are human-readable and can transfer to stronger models in some cases.

Problem Statement

Prompt performance depends on both the high-level prompting pattern and the concrete prompt content. Manual tuning is slow, model-specific, and hard to reuse. The problem is to automatically find a prompting pattern and prompt (including few-shot examples) that minimize task loss, and to return a result that is both executable and editable by humans.

Main Contribution

Formulate joint search over prompting patterns and prompt content and solve it with AutoML.

Provide a pattern library and a PDL-based source-to-source optimizer that outputs human-readable, executable prompt programs.

Key Findings

AutoPDL improved accuracy on evaluated benchmarks on average.

NumbersMean gain 9.21 ± 15.46 percentage points (accuracy)

Practical UseRun AutoPDL to likely improve model accuracy over zero-shot on similar classification/generation tasks.

Evidence RefAbstract; Table 1

The largest single improvement observed was huge but rare.

NumbersUp to +67.5 percentage points (Granite 13B Instruct V2 on FEVER)

Practical UseFor weak or misaligned models, AutoPDL can sometimes produce very large gains—worth trying when baseline is poor.

Evidence RefTable 1 (FEVER: Granite 13B Instruct V2)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	9.21 ± 15.46 pp	Zero-shot	+9.21pp	Aggregate across FEVER, GSM8K, MBPP+	Abstract; Table 1	Abstract; Table 1
Maximum single-run improvement	+67.5 pp	Zero-shot	+67.5pp	FEVER, Granite 13B Instruct V2	Table 1 (FEVER)	Table 1

What To Try In 7 Days

Clone the PDL repo and run AutoPDL on one model-task pair to compare to your zero-shot baseline.

Inspect and hand-edit the returned PDL program to match your toolchain and rerun on validation data.

If you have low resources, optimize on an open model and test the result once on a frontier API.

Agent Features

Memory

Few-shot demonstration bank (example bank)In-context learning (no fine-tuning)

Planning

Agentic TAO loop (Thought-Action-Observation)ReWOO (reasoning without observations)

Tool Use

Calc (SymPy evaluator)Search (Wikipedia summary)Execute (Python execution)Finish (end trajectory)

Frameworks

PDL (Prompt Declaration Language)AutoPDL

Is Agentic

Yes

Architectures

PDL prompt programsLLM agent patterns (ReAct, ReWOO, CoT, Zero-Shot)

Collaboration

Human-in-the-loop editing of PDL programs

Optimization Features

Token Efficiency

Limit number of demos to {0,3,5} to control context size

System Optimization

Source-to-source optimization: search space and result are PDL programsSuccessive halving to prune candidates cheaply

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IBM/prompt-declaration-language

Risks & Boundaries

Limitations

Optimization is compute-intensive: evaluates many candidates with multiple LLM calls.

Search space limited (demo counts only {0,3,5}); other choices may yield better results.

When Not To Use

You lack budget for repeated LLM calls or large-scale candidate evaluation.

You need instant deployment without time to validate optimized prompts.

Failure Modes

Optimizer returns zero-shot baseline when search space lacks helpful options.

Template-generated demonstrations mislead the search and degrade performance.

Core Entities

Models

LLaMA 3.1 8BLLaMA 3.2 3BLLaMA 3.3 70BGranite 3.1 8BGranite 13B Instruct V2Granite 20B CodeGranite 34B Codegpt-4o-mini-2024-07-18

Metrics

AccuracyAbsolute percentage-point delta

Datasets

FEVERGSM8KGSM-HardMBPP+MBPP

Benchmarks

FEVER (binary fact verification)GSM8K (grade-school math)GSM-Hard (hardified GSM8K)MBPP+ (code generation with extended tests)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AutoPDL improved accuracy on evaluated benchmarks on average.

The largest single improvement observed was huge but rare.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding

Pick the best prompt per query offline using inverse RL and cheap embeddings

Key finding