Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
4
Why It Matters For Business
AutoFlow reduces manual workflow design time by automatically producing readable, executable agent workflows and can raise task performance on image/text benchmarks, lowering operational cost for multi‑step agent tasks.
Summary TLDR
AutoFlow is a framework that automatically generates natural‑language workflows (CoRE programs) for LLM-based agents. It uses either LoRA fine‑tuning + REINFORCE for open models or reward‑conditioned in‑context prompting for closed models. On the OpenAGI benchmark AutoFlow-produced workflows outperform a manual CoRE baseline (e.g., average score 0.3597 vs 0.2483 with Mixtral interpreter; 0.6501 vs 0.6104 with GPT-4 interpreter). Code is available at the project GitHub.
Problem Statement
Designing step-by-step workflows for LLM agents is manual, slow, and needs domain expertise. This blocks scaling agent deployment. The paper asks: can we automatically generate readable, executable workflows in natural language and optimize them by execution feedback?
Main Contribution
AutoFlow framework that generates workflows as natural‑language programs (CoRE) and iteratively improves them using execution feedback.
Two generator methods: LoRA fine‑tuning + REINFORCE for open LLMs, and reward‑conditioned in‑context refinement for closed LLMs.
Empirical validation on OpenAGI showing higher valid‑plan rates and better average task scores than manual CoRE workflows.
Key Findings
AutoFlow improves average OpenAGI score compared to manual CoRE when Mixtral is the interpreter and GPT‑4 is the generator.
AutoFlow improves average OpenAGI score compared to manual CoRE when GPT‑4 is the interpreter and Mixtral is the generator.
Cross‑model combinations show complementary strengths: best results often come from mixing generator and interpreter models.
Results
Average over tasks (Mixtral interpreter)
Average over tasks (GPT-4 interpreter)
Per-task best scores (examples)
Who Should Care
What To Try In 7 Days
Run AutoFlow on a small OpenAGI-like task to compare generated workflows to your manual baseline.
Try the in‑context method with a closed model (GPT‑4) before settling on fine‑tuning.
Swap generator/interpreter models to test cross-model gains (e.g., closed generator + open interpreter).
Agent Features
Memory
- step-level retrieval from short-term memory
Planning
- workflow generation
- iterative reward-based optimization
Tool Use
- external tools selected by workflow interpreter
Frameworks
- CoRE
Is Agentic
true
Architectures
- closed-source LLM generator (GPT-4)
- open-source LLM generator (Mixtral-8x7B)
Collaboration
- paired generator and interpreter LLMs (collaborative learning)
Optimization Features
Model Optimization
- LoRA
Training Optimization
- RL
- reward-conditioned in-context refinement
Inference Optimization
- in-context prompting for closed models
Reproducibility
Data Urls
- OpenAGI benchmark (reference [7] in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments only run on the OpenAGI benchmark, so generality to other domains is untested.
- Fine‑tuning for open models required a GPT‑4 parser to fix grammar for Mixtral outputs.
- The generator learning uses REINFORCE, which can be sample inefficient and unstable.
When Not To Use
- When you need formal, provable correctness rather than natural‑language steps.
- When no validation data or execution feedback exists to provide a reward signal.
- When RL sample budget or compute is extremely limited for iterative fine‑tuning.
Failure Modes
- Generator outputs grammatically invalid or non-executable workflows (observed for Mixtral before parsing).
- Reward hacking: workflows that score well on the chosen metric but fail human needs.
- Overfitting to the OpenAGI reward metric and dataset specifics.
Core Entities
Models
- GPT-4
- Mixtral-8x7B
Metrics
- CLIP Score
- BERT Score
- ViT Score
- Average over tasks
Datasets
- OpenAGI
Benchmarks
- OpenAGI
Context Entities
Metrics
- CLIP Score (text-image similarity)
- BERTScore (text similarity)
- ViT Score (image similarity)
Datasets
- OpenAGI benchmark (reference [7])

