OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

April 10, 20238 min

Overview

Decision SnapshotNeeds Validation

The platform is a practical, open pipeline with experiments showing RLTF and LoRA materially improve open-model planning; code and datasets are available but many tasks use small datasets and real-world robustness/OOD remain open problems.

Citations76

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

License: Creative Commons Attribution 4.0 International (data); code repo states open‑sor

At A Glance

Cost impact: 45%

Production readiness: 50%

Novelty: 60%

Authors

Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Who Should Care

Summary TLDR

OpenAGI is an open-source research platform that tests LLMs as controllers that select, sequence, and execute domain expert models (vision, language, vision-language) to solve complex multi-step tasks. The platform provides 185 multi-step tasks (117 linear, 68 non-linear), small task datasets, evaluation metrics (CLIP, BERTScore, ViT), and an LLM tuning loop called RLTF (Reinforcement Learning from Task Feedback). Experiments show closed-source LLMs (GPT-4) lead in zero/few-shot settings, but open-source models (LLaMA-2-13B, Vicuna-7B, Flan-T5-Large) gain large, practical improvements from fine-tuning and RLTF, making smaller models competitive. Code, data, and benchmarks are published.

Problem Statement

LLMs can reason and call tools, but current tool-using systems struggle with expandability, non-linear (tree) planning, and numeric evaluation. We need a shared platform to (1) combine many domain expert models, (2) create multi-step linear and non-linear tasks, and (3) measure whether an LLM can plan, execute, and improve via task feedback.

Main Contribution

OpenAGI: an open-source platform with 185 multi-step tasks, datasets, evaluation metrics, and a UI to test LLM-driven model synthesis.

A practical pipeline where an LLM generates a plan of domain expert models, constrained-beam-search parses plans, then executes models to produce outputs and scores.

Key Findings

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

NumbersGPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

Practical UseIf you need good out-of-the-box planning, use a large closed LLM like GPT-4 for zero/few-shot workflows.

Evidence RefTable 1

Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.

NumbersLLaMA-2-13B overall: 0.1533 (zero) -> 0.2967 (fine-tune) -> 0.3735 (RLTF)

Practical UseApply LoRA + RLTF to open models to get big gains and approach closed-model performance for model-selection tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall (closed-source LLMs)GPT-4 few-shot 0.5281; GPT-3.5 few-shot 0.3772zero-shot GPT-4 0.2378GPT-4 few vs zero +0.2903OpenAGI benchmark (avg of CLIP/BERT/ViT)Table 1 shows closed-source LLM scoresTable 1
Overall (open-source LLMs)LLaMA-2-13B RLTF 0.3735; Vicuna-7B RLTF 0.3018LLaMA zero-shot 0.1533LLaMA zero -> RLTF +0.2202OpenAGI benchmarkTable 2 quantifies tuning gains for open modelsTable 2

What To Try In 7 Days

Clone the OpenAGI repo and run a few benchmark tasks with your LLM.

Plug in one domain model (e.g., image denoiser) and test LLM planning with constrained beam search.

Fine-tune an open LLM with LoRA on a few task plans and compare zero/few-shot vs fine-tune outputs on 10 tasks from the suite.

Agent Features

Memory
Short-term plan outputs used for immediate execution (no long-term retrieval described)
Planning
Non-linear task planning (tree-structured plans)Constrained beam search for valid model sequences
Tool Use
Selects, sequences and executes specialist modelsCalls web tools/APIs via LangChain (Google/Wikipedia/Wolfram)
Frameworks
LangChainHugging Face (transformers, diffusers)
Is Agentic

Yes

Architectures
LLM controller + external domain expert models
Collaboration
Orchestrates multiple models in parallel or sequence

Optimization Features

Model Optimization
LoRA
System Optimization
Plan parsing via a GPT-3.5-based parser to map free text to module sequences
Training Optimization
RLFine-tuning with human-labeled plan solutions
Inference Optimization
Constrained beam search to ensure valid model-name outputs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCreative Commons Attribution 4.0 International (data); code repo states open‑sor

Risks & Boundaries

Limitations

Benchmarks use small per-task datasets (100 samples each), limiting statistical strength

Performance depends on the quality and domain coverage of plugged expert models

When Not To Use

Do not use as-is for safety-critical systems without human oversight

Not ideal for single-step tasks where direct specialist models are already sufficient

Failure Modes

LLM produces invalid or suboptimal model sequences that degrade output quality

Plan hallucination: proposing nonexistent or inappropriate model calls

Core Entities

Models

GPT-3.5-turboGPT-4Claude-2Flan-T5-LargeVicuna-7BLLaMA-2-13BStable DiffusionRestormerSwin2SRGITDETRViTBARTT5DistilRoBERTaDistilBERT

Metrics

CLIP ScoreBERT ScoreViT Score

Datasets

ImageNet-1KCOCOCNN/Daily MailSST2TextVQASQuADOpenAGI benchmark datasets (185 tasks, 100 samples each)

Benchmarks

OpenAGI multi-step task suite