OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

April 10, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

76

Authors

Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Summary TLDR

OpenAGI is an open-source research platform that tests LLMs as controllers that select, sequence, and execute domain expert models (vision, language, vision-language) to solve complex multi-step tasks. The platform provides 185 multi-step tasks (117 linear, 68 non-linear), small task datasets, evaluation metrics (CLIP, BERTScore, ViT), and an LLM tuning loop called RLTF (Reinforcement Learning from Task Feedback). Experiments show closed-source LLMs (GPT-4) lead in zero/few-shot settings, but open-source models (LLaMA-2-13B, Vicuna-7B, Flan-T5-Large) gain large, practical improvements from fine-tuning and RLTF, making smaller models competitive. Code, data, and benchmarks are published.

Problem Statement

LLMs can reason and call tools, but current tool-using systems struggle with expandability, non-linear (tree) planning, and numeric evaluation. We need a shared platform to (1) combine many domain expert models, (2) create multi-step linear and non-linear tasks, and (3) measure whether an LLM can plan, execute, and improve via task feedback.

Main Contribution

OpenAGI: an open-source platform with 185 multi-step tasks, datasets, evaluation metrics, and a UI to test LLM-driven model synthesis.

A practical pipeline where an LLM generates a plan of domain expert models, constrained-beam-search parses plans, then executes models to produce outputs and scores.

RLTF (Reinforcement Learning from Task Feedback): tune LLMs using task-level reward signals (REINFORCE) so planning improves from task outcomes.

Empirical comparison of closed- and open-source LLMs across zero/few-shot, fine-tuning, and RLTF showing RLTF strongly helps smaller open models.

Key Findings

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

NumbersGPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.

NumbersLLaMA-2-13B overall: 0.1533 (zero) -> 0.2967 (fine-tune) -> 0.3735 (RLTF)

OpenAGI contains 185 multi-step tasks with both linear and non-linear planning structures.

Numbers185 tasks total: 117 linear, 68 non-linear

Providing detailed model descriptions in prompts helps large closed models but can confuse smaller open models.

NumbersZero-shot Prompt-2 improves GPT-3.5/GPT-4 but can reduce open-model performance (see Tab. 3/4 differences)

Results

Overall (closed-source LLMs)

ValueGPT-4 few-shot 0.5281; GPT-3.5 few-shot 0.3772

Baselinezero-shot GPT-4 0.2378

Overall (open-source LLMs)

ValueLLaMA-2-13B RLTF 0.3735; Vicuna-7B RLTF 0.3018

BaselineLLaMA zero-shot 0.1533

Task split

Value185 total tasks: 117 linear, 68 non-linear

Who Should Care

What To Try In 7 Days

Clone the OpenAGI repo and run a few benchmark tasks with your LLM.

Plug in one domain model (e.g., image denoiser) and test LLM planning with constrained beam search.

Fine-tune an open LLM with LoRA on a few task plans and compare zero/few-shot vs fine-tune outputs on 10 tasks from the suite.

Agent Features

Memory

  • Short-term plan outputs used for immediate execution (no long-term retrieval described)

Planning

  • Non-linear task planning (tree-structured plans)
  • Constrained beam search for valid model sequences

Tool Use

  • Selects, sequences and executes specialist models
  • Calls web tools/APIs via LangChain (Google/Wikipedia/Wolfram)

Frameworks

  • LangChain
  • Hugging Face (transformers, diffusers)

Is Agentic

true

Architectures

  • LLM controller + external domain expert models

Collaboration

  • Orchestrates multiple models in parallel or sequence

Optimization Features

Model Optimization

  • LoRA

System Optimization

  • Plan parsing via a GPT-3.5-based parser to map free text to module sequences

Training Optimization

  • RL
  • Fine-tuning with human-labeled plan solutions

Inference Optimization

  • Constrained beam search to ensure valid model-name outputs

Reproducibility

License

  • Creative Commons Attribution 4.0 International (data); code repo states open‑sor

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks use small per-task datasets (100 samples each), limiting statistical strength
  • Performance depends on the quality and domain coverage of plugged expert models
  • OpenAGI relies on closed-source LLMs for best zero-shot results, reducing reproducibility for some users
  • OOD generalization and optimal plan search remain unsolved and can cause poor outputs

When Not To Use

  • Do not use as-is for safety-critical systems without human oversight
  • Not ideal for single-step tasks where direct specialist models are already sufficient
  • Avoid expecting guaranteed best-order model sequencing; planning can be suboptimal

Failure Modes

  • LLM produces invalid or suboptimal model sequences that degrade output quality
  • Plan hallucination: proposing nonexistent or inappropriate model calls
  • Out-of-distribution inputs break specialist model performance leading to bad final outputs

Core Entities

Models

  • GPT-3.5-turbo
  • GPT-4
  • Claude-2
  • Flan-T5-Large
  • Vicuna-7B
  • LLaMA-2-13B
  • Stable Diffusion
  • Restormer
  • Swin2SR
  • GIT
  • DETR
  • ViT
  • BART
  • T5
  • DistilRoBERTa
  • DistilBERT

Metrics

  • CLIP Score
  • BERT Score
  • ViT Score

Datasets

  • ImageNet-1K
  • COCO
  • CNN/Daily Mail
  • SST2
  • TextVQA
  • SQuAD
  • OpenAGI benchmark datasets (185 tasks, 100 samples each)

Benchmarks

  • OpenAGI multi-step task suite