Overview
The platform is a practical, open pipeline with experiments showing RLTF and LoRA materially improve open-model planning; code and datasets are available but many tasks use small datasets and real-world robustness/OOD remain open problems.
Citations76
Evidence Strength0.70
Confidence0.78
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
License: Creative Commons Attribution 4.0 International (data); code repo states open‑sor
At A Glance
Cost impact: 45%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.
Who Should Care
Summary TLDR
OpenAGI is an open-source research platform that tests LLMs as controllers that select, sequence, and execute domain expert models (vision, language, vision-language) to solve complex multi-step tasks. The platform provides 185 multi-step tasks (117 linear, 68 non-linear), small task datasets, evaluation metrics (CLIP, BERTScore, ViT), and an LLM tuning loop called RLTF (Reinforcement Learning from Task Feedback). Experiments show closed-source LLMs (GPT-4) lead in zero/few-shot settings, but open-source models (LLaMA-2-13B, Vicuna-7B, Flan-T5-Large) gain large, practical improvements from fine-tuning and RLTF, making smaller models competitive. Code, data, and benchmarks are published.
Problem Statement
LLMs can reason and call tools, but current tool-using systems struggle with expandability, non-linear (tree) planning, and numeric evaluation. We need a shared platform to (1) combine many domain expert models, (2) create multi-step linear and non-linear tasks, and (3) measure whether an LLM can plan, execute, and improve via task feedback.
Main Contribution
OpenAGI: an open-source platform with 185 multi-step tasks, datasets, evaluation metrics, and a UI to test LLM-driven model synthesis.
A practical pipeline where an LLM generates a plan of domain expert models, constrained-beam-search parses plans, then executes models to produce outputs and scores.
Key Findings
A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.
Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall (closed-source LLMs) | GPT-4 few-shot 0.5281; GPT-3.5 few-shot 0.3772 | zero-shot GPT-4 0.2378 | GPT-4 few vs zero +0.2903 | OpenAGI benchmark (avg of CLIP/BERT/ViT) | Table 1 shows closed-source LLM scores | Table 1 |
| Overall (open-source LLMs) | LLaMA-2-13B RLTF 0.3735; Vicuna-7B RLTF 0.3018 | LLaMA zero-shot 0.1533 | LLaMA zero -> RLTF +0.2202 | OpenAGI benchmark | Table 2 quantifies tuning gains for open models | Table 2 |
What To Try In 7 Days
Clone the OpenAGI repo and run a few benchmark tasks with your LLM.
Plug in one domain model (e.g., image denoiser) and test LLM planning with constrained beam search.
Fine-tune an open LLM with LoRA on a few task plans and compare zero/few-shot vs fine-tune outputs on 10 tasks from the suite.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Benchmarks use small per-task datasets (100 samples each), limiting statistical strength
Performance depends on the quality and domain coverage of plugged expert models
When Not To Use
Do not use as-is for safety-critical systems without human oversight
Not ideal for single-step tasks where direct specialist models are already sufficient
Failure Modes
LLM produces invalid or suboptimal model sequences that degrade output quality
Plan hallucination: proposing nonexistent or inappropriate model calls

