AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

April 6, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

Links

Abstract / PDF

Why It Matters For Business

AutoPDL automates prompt and agent-pattern selection and outputs editable, executable prompt programs, reducing manual tuning time and allowing reuse across models while sometimes improving accuracy substantially.

Summary TLDR

AutoPDL frames prompt engineering as an AutoML problem: it jointly searches prompting patterns (Zero-Shot, CoT, ReAct, ReWOO) and prompt content (instructions and few-shot examples) expressed in a human-editable PDL program. Using successive halving to prune candidates, AutoPDL returns executable PDL programs. Evaluated on FEVER, GSM8K, GSM-Hard and MBPP+ across seven models (3B–70B), it yields average gains (9.21 ± 15.46 percentage points) and up to 67.5pp on a single run. Optimized PDL programs are human-readable and can transfer to stronger models in some cases.

Problem Statement

Prompt performance depends on both the high-level prompting pattern and the concrete prompt content. Manual tuning is slow, model-specific, and hard to reuse. The problem is to automatically find a prompting pattern and prompt (including few-shot examples) that minimize task loss, and to return a result that is both executable and editable by humans.

Main Contribution

Formulate joint search over prompting patterns and prompt content and solve it with AutoML.

Provide a pattern library and a PDL-based source-to-source optimizer that outputs human-readable, executable prompt programs.

Empirically show gains across tasks and models and that optimal prompting varies by model and task.

Key Findings

AutoPDL improved accuracy on evaluated benchmarks on average.

NumbersMean gain 9.21 ± 15.46 percentage points (accuracy)

The largest single improvement observed was huge but rare.

NumbersUp to +67.5 percentage points (Granite 13B Instruct V2 on FEVER)

Optimal prompting pattern varies by model and task.

NumbersDifferent models used CoT, ReAct, or ReWOO as best across datasets

Optimized programs sometimes transfer to stronger closed-source models.

NumbersUp to +13.1pp on GPT-4o-mini for GSM8K when using PDL optimized on LLaMA

Cross-dataset few-shot reuse can help low-resource tasks.

NumbersUp to +6.5pp on GSM-Hard using GSM8K demonstrations

Results

Accuracy

Value9.21 ± 15.46 pp

BaselineZero-shot

Maximum single-run improvement

Value+67.5 pp

BaselineZero-shot

Transfer to frontier model (example)

Value+13.1 pp

BaselineZero-shot on GPT-4o-mini

Who Should Care

What To Try In 7 Days

Clone the PDL repo and run AutoPDL on one model-task pair to compare to your zero-shot baseline.

Inspect and hand-edit the returned PDL program to match your toolchain and rerun on validation data.

If you have low resources, optimize on an open model and test the result once on a frontier API.

Agent Features

Memory

  • Few-shot demonstration bank (example bank)
  • In-context learning (no fine-tuning)

Planning

  • Agentic TAO loop (Thought-Action-Observation)
  • ReWOO (reasoning without observations)

Tool Use

  • Calc (SymPy evaluator)
  • Search (Wikipedia summary)
  • Execute (Python execution)
  • Finish (end trajectory)

Frameworks

  • PDL (Prompt Declaration Language)
  • AutoPDL

Is Agentic

true

Architectures

  • PDL prompt programs
  • LLM agent patterns (ReAct, ReWOO, CoT, Zero-Shot)

Collaboration

  • Human-in-the-loop editing of PDL programs

Optimization Features

Token Efficiency

  • Limit number of demos to {0,3,5} to control context size

System Optimization

  • Source-to-source optimization: search space and result are PDL programs
  • Successive halving to prune candidates cheaply

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Optimization is compute-intensive: evaluates many candidates with multiple LLM calls.
  • Search space limited (demo counts only {0,3,5}); other choices may yield better results.
  • Agent trajectories are generated by rule-based templates that may be simplistic.
  • Risk of overfitting to validation split if search is not configured carefully.

When Not To Use

  • You lack budget for repeated LLM calls or large-scale candidate evaluation.
  • You need instant deployment without time to validate optimized prompts.
  • Your task requires fine-tuning or model changes rather than prompt-level fixes.

Failure Modes

  • Optimizer returns zero-shot baseline when search space lacks helpful options.
  • Template-generated demonstrations mislead the search and degrade performance.
  • Selected prompt program may not transfer to different domains or models.

Core Entities

Models

  • LLaMA 3.1 8B
  • LLaMA 3.2 3B
  • LLaMA 3.3 70B
  • Granite 3.1 8B
  • Granite 13B Instruct V2
  • Granite 20B Code
  • Granite 34B Code
  • gpt-4o-mini-2024-07-18

Metrics

  • Accuracy
  • Absolute percentage-point delta

Datasets

  • FEVER
  • GSM8K
  • GSM-Hard
  • MBPP+
  • MBPP

Benchmarks

  • FEVER (binary fact verification)
  • GSM8K (grade-school math)
  • GSM-Hard (hardified GSM8K)
  • MBPP+ (code generation with extended tests)