OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

Overview

Decision SnapshotNeeds Validation

The platform is a practical, open pipeline with experiments showing RLTF and LoRA materially improve open-model planning; code and datasets are available but many tasks use small datasets and real-world robustness/OOD remain open problems.

Citations76

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

License: Creative Commons Attribution 4.0 International (data); code repo states open‑sor

At A Glance

Cost impact: 45%

Production readiness: 50%

Novelty: 60%

Authors

Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

OpenAGI is an open-source research platform that tests LLMs as controllers that select, sequence, and execute domain expert models (vision, language, vision-language) to solve complex multi-step tasks. The platform provides 185 multi-step tasks (117 linear, 68 non-linear), small task datasets, evaluation metrics (CLIP, BERTScore, ViT), and an LLM tuning loop called RLTF (Reinforcement Learning from Task Feedback). Experiments show closed-source LLMs (GPT-4) lead in zero/few-shot settings, but open-source models (LLaMA-2-13B, Vicuna-7B, Flan-T5-Large) gain large, practical improvements from fine-tuning and RLTF, making smaller models competitive. Code, data, and benchmarks are published.

Problem Statement

LLMs can reason and call tools, but current tool-using systems struggle with expandability, non-linear (tree) planning, and numeric evaluation. We need a shared platform to (1) combine many domain expert models, (2) create multi-step linear and non-linear tasks, and (3) measure whether an LLM can plan, execute, and improve via task feedback.

Main Contribution

OpenAGI: an open-source platform with 185 multi-step tasks, datasets, evaluation metrics, and a UI to test LLM-driven model synthesis.

A practical pipeline where an LLM generates a plan of domain expert models, constrained-beam-search parses plans, then executes models to produce outputs and scores.

Key Findings

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

NumbersGPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

Practical UseIf you need good out-of-the-box planning, use a large closed LLM like GPT-4 for zero/few-shot workflows.

Evidence RefTable 1

Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.

NumbersLLaMA-2-13B overall: 0.1533 (zero) -> 0.2967 (fine-tune) -> 0.3735 (RLTF)

Practical UseApply LoRA + RLTF to open models to get big gains and approach closed-model performance for model-selection tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall (closed-source LLMs)	GPT-4 few-shot 0.5281; GPT-3.5 few-shot 0.3772	zero-shot GPT-4 0.2378	GPT-4 few vs zero +0.2903	OpenAGI benchmark (avg of CLIP/BERT/ViT)	Table 1 shows closed-source LLM scores	Table 1
Overall (open-source LLMs)	LLaMA-2-13B RLTF 0.3735; Vicuna-7B RLTF 0.3018	LLaMA zero-shot 0.1533	LLaMA zero -> RLTF +0.2202	OpenAGI benchmark	Table 2 quantifies tuning gains for open models	Table 2

What To Try In 7 Days

Clone the OpenAGI repo and run a few benchmark tasks with your LLM.

Plug in one domain model (e.g., image denoiser) and test LLM planning with constrained beam search.

Fine-tune an open LLM with LoRA on a few task plans and compare zero/few-shot vs fine-tune outputs on 10 tasks from the suite.

Agent Features

Memory

Short-term plan outputs used for immediate execution (no long-term retrieval described)

Planning

Non-linear task planning (tree-structured plans)Constrained beam search for valid model sequences

Tool Use

Selects, sequences and executes specialist modelsCalls web tools/APIs via LangChain (Google/Wikipedia/Wolfram)

Frameworks

LangChainHugging Face (transformers, diffusers)

Is Agentic

Yes

Architectures

LLM controller + external domain expert models

Collaboration

Orchestrates multiple models in parallel or sequence

Optimization Features

Model Optimization

LoRA

System Optimization

Plan parsing via a GPT-3.5-based parser to map free text to module sequences

Training Optimization

RLFine-tuning with human-labeled plan solutions

Inference Optimization

Constrained beam search to ensure valid model-name outputs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCreative Commons Attribution 4.0 International (data); code repo states open‑sor

Code URLs

https://github.com/agiresearch/OpenAGI https://drive.google.com/drive/folders/1AjT6y7qLIMxcmHhUBG5IE1_5SnCPR57e

Data URLs

https://drive.google.com/drive/folders/1AjT6y7qLIMxcmHhUBG5IE1_5SnCPR57e https://github.com/agiresearch/OpenAGI

Risks & Boundaries

Limitations

Benchmarks use small per-task datasets (100 samples each), limiting statistical strength

Performance depends on the quality and domain coverage of plugged expert models

When Not To Use

Do not use as-is for safety-critical systems without human oversight

Not ideal for single-step tasks where direct specialist models are already sufficient

Failure Modes

LLM produces invalid or suboptimal model sequences that degrade output quality

Plan hallucination: proposing nonexistent or inappropriate model calls

Core Entities

Models

GPT-3.5-turboGPT-4Claude-2Flan-T5-LargeVicuna-7BLLaMA-2-13BStable DiffusionRestormerSwin2SRGITDETRViTBARTT5DistilRoBERTaDistilBERT

Metrics

CLIP ScoreBERT ScoreViT Score

Datasets

ImageNet-1KCOCOCNN/Daily MailSST2TextVQASQuADOpenAGI benchmark datasets (185 tasks, 100 samples each)

Benchmarks

OpenAGI multi-step task suite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

A 1,000-task, real-server benchmark that measures how well LLMs discover and use tools

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

Key finding