AutoFlow: automatically generate readable natural‑language workflows so LLM agents solve complex tasks with less human work

July 1, 20246 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

AutoFlow reduces manual workflow design time by automatically producing readable, executable agent workflows and can raise task performance on image/text benchmarks, lowering operational cost for multi‑step agent tasks.

Summary TLDR

AutoFlow is a framework that automatically generates natural‑language workflows (CoRE programs) for LLM-based agents. It uses either LoRA fine‑tuning + REINFORCE for open models or reward‑conditioned in‑context prompting for closed models. On the OpenAGI benchmark AutoFlow-produced workflows outperform a manual CoRE baseline (e.g., average score 0.3597 vs 0.2483 with Mixtral interpreter; 0.6501 vs 0.6104 with GPT-4 interpreter). Code is available at the project GitHub.

Problem Statement

Designing step-by-step workflows for LLM agents is manual, slow, and needs domain expertise. This blocks scaling agent deployment. The paper asks: can we automatically generate readable, executable workflows in natural language and optimize them by execution feedback?

Main Contribution

AutoFlow framework that generates workflows as natural‑language programs (CoRE) and iteratively improves them using execution feedback.

Two generator methods: LoRA fine‑tuning + REINFORCE for open LLMs, and reward‑conditioned in‑context refinement for closed LLMs.

Empirical validation on OpenAGI showing higher valid‑plan rates and better average task scores than manual CoRE workflows.

Key Findings

AutoFlow improves average OpenAGI score compared to manual CoRE when Mixtral is the interpreter and GPT‑4 is the generator.

Numbersavg 0.3597 vs 0.2483 (Δ +0.1114, +44.9%)

AutoFlow improves average OpenAGI score compared to manual CoRE when GPT‑4 is the interpreter and Mixtral is the generator.

Numbersavg 0.6501 vs 0.6104 (Δ +0.0397, +6.5%)

Cross‑model combinations show complementary strengths: best results often come from mixing generator and interpreter models.

Results

Average over tasks (Mixtral interpreter)

ValueAutoFlow (GPT generator) 0.3597; CoRE 0.2483

BaselineCoRE 0.2483

Average over tasks (GPT-4 interpreter)

ValueAutoFlow (Mixtral generator) 0.6501; CoRE 0.6104

BaselineCoRE 0.6104

Per-task best scores (examples)

ValueTask 3 ViT Score: AutoFlow (GPT) 0.5720 (Mixtral interpreter) and 0.6899 (GPT interpreter)

BaselineCoRE Task 3: 0.2437 (Mixtral) and 0.648 (GPT)

Who Should Care

What To Try In 7 Days

Run AutoFlow on a small OpenAGI-like task to compare generated workflows to your manual baseline.

Try the in‑context method with a closed model (GPT‑4) before settling on fine‑tuning.

Swap generator/interpreter models to test cross-model gains (e.g., closed generator + open interpreter).

Agent Features

Memory

  • step-level retrieval from short-term memory

Planning

  • workflow generation
  • iterative reward-based optimization

Tool Use

  • external tools selected by workflow interpreter

Frameworks

  • CoRE

Is Agentic

true

Architectures

  • closed-source LLM generator (GPT-4)
  • open-source LLM generator (Mixtral-8x7B)

Collaboration

  • paired generator and interpreter LLMs (collaborative learning)

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • RL
  • reward-conditioned in-context refinement

Inference Optimization

  • in-context prompting for closed models

Reproducibility

Data Urls

  • OpenAGI benchmark (reference [7] in paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments only run on the OpenAGI benchmark, so generality to other domains is untested.
  • Fine‑tuning for open models required a GPT‑4 parser to fix grammar for Mixtral outputs.
  • The generator learning uses REINFORCE, which can be sample inefficient and unstable.

When Not To Use

  • When you need formal, provable correctness rather than natural‑language steps.
  • When no validation data or execution feedback exists to provide a reward signal.
  • When RL sample budget or compute is extremely limited for iterative fine‑tuning.

Failure Modes

  • Generator outputs grammatically invalid or non-executable workflows (observed for Mixtral before parsing).
  • Reward hacking: workflows that score well on the chosen metric but fail human needs.
  • Overfitting to the OpenAGI reward metric and dataset specifics.

Core Entities

Models

  • GPT-4
  • Mixtral-8x7B

Metrics

  • CLIP Score
  • BERT Score
  • ViT Score
  • Average over tasks

Datasets

  • OpenAGI

Benchmarks

  • OpenAGI

Context Entities

Metrics

  • CLIP Score (text-image similarity)
  • BERTScore (text similarity)
  • ViT Score (image similarity)

Datasets

  • OpenAGI benchmark (reference [7])