Overview
AutoPDL is practical: it produces editable prompt programs and shows clear gains, but it requires non-trivial compute to evaluate many candidates and depends on quality of demonstration templates.
Citations0
Evidence Strength0.85
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
AutoPDL automates prompt and agent-pattern selection and outputs editable, executable prompt programs, reducing manual tuning time and allowing reuse across models while sometimes improving accuracy substantially.
Who Should Care
Summary TLDR
AutoPDL frames prompt engineering as an AutoML problem: it jointly searches prompting patterns (Zero-Shot, CoT, ReAct, ReWOO) and prompt content (instructions and few-shot examples) expressed in a human-editable PDL program. Using successive halving to prune candidates, AutoPDL returns executable PDL programs. Evaluated on FEVER, GSM8K, GSM-Hard and MBPP+ across seven models (3B–70B), it yields average gains (9.21 ± 15.46 percentage points) and up to 67.5pp on a single run. Optimized PDL programs are human-readable and can transfer to stronger models in some cases.
Problem Statement
Prompt performance depends on both the high-level prompting pattern and the concrete prompt content. Manual tuning is slow, model-specific, and hard to reuse. The problem is to automatically find a prompting pattern and prompt (including few-shot examples) that minimize task loss, and to return a result that is both executable and editable by humans.
Main Contribution
Formulate joint search over prompting patterns and prompt content and solve it with AutoML.
Provide a pattern library and a PDL-based source-to-source optimizer that outputs human-readable, executable prompt programs.
Key Findings
AutoPDL improved accuracy on evaluated benchmarks on average.
The largest single improvement observed was huge but rare.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 9.21 ± 15.46 pp | Zero-shot | +9.21pp | Aggregate across FEVER, GSM8K, MBPP+ | Abstract; Table 1 | Abstract; Table 1 |
| Maximum single-run improvement | +67.5 pp | Zero-shot | +67.5pp | FEVER, Granite 13B Instruct V2 | Table 1 (FEVER) | Table 1 |
What To Try In 7 Days
Clone the PDL repo and run AutoPDL on one model-task pair to compare to your zero-shot baseline.
Inspect and hand-edit the returned PDL program to match your toolchain and rerun on validation data.
If you have low resources, optimize on an open model and test the result once on a frontier API.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Optimization is compute-intensive: evaluates many candidates with multiple LLM calls.
Search space limited (demo counts only {0,3,5}); other choices may yield better results.
When Not To Use
You lack budget for repeated LLM calls or large-scale candidate evaluation.
You need instant deployment without time to validate optimized prompts.
Failure Modes
Optimizer returns zero-shot baseline when search space lacks helpful options.
Template-generated demonstrations mislead the search and degrade performance.

