Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
AutoPDL automates prompt and agent-pattern selection and outputs editable, executable prompt programs, reducing manual tuning time and allowing reuse across models while sometimes improving accuracy substantially.
Summary TLDR
AutoPDL frames prompt engineering as an AutoML problem: it jointly searches prompting patterns (Zero-Shot, CoT, ReAct, ReWOO) and prompt content (instructions and few-shot examples) expressed in a human-editable PDL program. Using successive halving to prune candidates, AutoPDL returns executable PDL programs. Evaluated on FEVER, GSM8K, GSM-Hard and MBPP+ across seven models (3B–70B), it yields average gains (9.21 ± 15.46 percentage points) and up to 67.5pp on a single run. Optimized PDL programs are human-readable and can transfer to stronger models in some cases.
Problem Statement
Prompt performance depends on both the high-level prompting pattern and the concrete prompt content. Manual tuning is slow, model-specific, and hard to reuse. The problem is to automatically find a prompting pattern and prompt (including few-shot examples) that minimize task loss, and to return a result that is both executable and editable by humans.
Main Contribution
Formulate joint search over prompting patterns and prompt content and solve it with AutoML.
Provide a pattern library and a PDL-based source-to-source optimizer that outputs human-readable, executable prompt programs.
Empirically show gains across tasks and models and that optimal prompting varies by model and task.
Key Findings
AutoPDL improved accuracy on evaluated benchmarks on average.
The largest single improvement observed was huge but rare.
Optimal prompting pattern varies by model and task.
Optimized programs sometimes transfer to stronger closed-source models.
Cross-dataset few-shot reuse can help low-resource tasks.
Results
Accuracy
Maximum single-run improvement
Transfer to frontier model (example)
Who Should Care
What To Try In 7 Days
Clone the PDL repo and run AutoPDL on one model-task pair to compare to your zero-shot baseline.
Inspect and hand-edit the returned PDL program to match your toolchain and rerun on validation data.
If you have low resources, optimize on an open model and test the result once on a frontier API.
Agent Features
Memory
- Few-shot demonstration bank (example bank)
- In-context learning (no fine-tuning)
Planning
- Agentic TAO loop (Thought-Action-Observation)
- ReWOO (reasoning without observations)
Tool Use
- Calc (SymPy evaluator)
- Search (Wikipedia summary)
- Execute (Python execution)
- Finish (end trajectory)
Frameworks
- PDL (Prompt Declaration Language)
- AutoPDL
Is Agentic
true
Architectures
- PDL prompt programs
- LLM agent patterns (ReAct, ReWOO, CoT, Zero-Shot)
Collaboration
- Human-in-the-loop editing of PDL programs
Optimization Features
Token Efficiency
- Limit number of demos to {0,3,5} to control context size
System Optimization
- Source-to-source optimization: search space and result are PDL programs
- Successive halving to prune candidates cheaply
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Optimization is compute-intensive: evaluates many candidates with multiple LLM calls.
- Search space limited (demo counts only {0,3,5}); other choices may yield better results.
- Agent trajectories are generated by rule-based templates that may be simplistic.
- Risk of overfitting to validation split if search is not configured carefully.
When Not To Use
- You lack budget for repeated LLM calls or large-scale candidate evaluation.
- You need instant deployment without time to validate optimized prompts.
- Your task requires fine-tuning or model changes rather than prompt-level fixes.
Failure Modes
- Optimizer returns zero-shot baseline when search space lacks helpful options.
- Template-generated demonstrations mislead the search and degrade performance.
- Selected prompt program may not transfer to different domains or models.
Core Entities
Models
- LLaMA 3.1 8B
- LLaMA 3.2 3B
- LLaMA 3.3 70B
- Granite 3.1 8B
- Granite 13B Instruct V2
- Granite 20B Code
- Granite 34B Code
- gpt-4o-mini-2024-07-18
Metrics
- Accuracy
- Absolute percentage-point delta
Datasets
- FEVER
- GSM8K
- GSM-Hard
- MBPP+
- MBPP
Benchmarks
- FEVER (binary fact verification)
- GSM8K (grade-school math)
- GSM-Hard (hardified GSM8K)
- MBPP+ (code generation with extended tests)

