Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
76
Why It Matters For Business
OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.
Summary TLDR
OpenAGI is an open-source research platform that tests LLMs as controllers that select, sequence, and execute domain expert models (vision, language, vision-language) to solve complex multi-step tasks. The platform provides 185 multi-step tasks (117 linear, 68 non-linear), small task datasets, evaluation metrics (CLIP, BERTScore, ViT), and an LLM tuning loop called RLTF (Reinforcement Learning from Task Feedback). Experiments show closed-source LLMs (GPT-4) lead in zero/few-shot settings, but open-source models (LLaMA-2-13B, Vicuna-7B, Flan-T5-Large) gain large, practical improvements from fine-tuning and RLTF, making smaller models competitive. Code, data, and benchmarks are published.
Problem Statement
LLMs can reason and call tools, but current tool-using systems struggle with expandability, non-linear (tree) planning, and numeric evaluation. We need a shared platform to (1) combine many domain expert models, (2) create multi-step linear and non-linear tasks, and (3) measure whether an LLM can plan, execute, and improve via task feedback.
Main Contribution
OpenAGI: an open-source platform with 185 multi-step tasks, datasets, evaluation metrics, and a UI to test LLM-driven model synthesis.
A practical pipeline where an LLM generates a plan of domain expert models, constrained-beam-search parses plans, then executes models to produce outputs and scores.
RLTF (Reinforcement Learning from Task Feedback): tune LLMs using task-level reward signals (REINFORCE) so planning improves from task outcomes.
Empirical comparison of closed- and open-source LLMs across zero/few-shot, fine-tuning, and RLTF showing RLTF strongly helps smaller open models.
Key Findings
A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.
Fine-tuning + RLTF substantially lifts open-source LLMs' task-planning performance.
OpenAGI contains 185 multi-step tasks with both linear and non-linear planning structures.
Providing detailed model descriptions in prompts helps large closed models but can confuse smaller open models.
Results
Overall (closed-source LLMs)
Overall (open-source LLMs)
Task split
Who Should Care
What To Try In 7 Days
Clone the OpenAGI repo and run a few benchmark tasks with your LLM.
Plug in one domain model (e.g., image denoiser) and test LLM planning with constrained beam search.
Fine-tune an open LLM with LoRA on a few task plans and compare zero/few-shot vs fine-tune outputs on 10 tasks from the suite.
Agent Features
Memory
- Short-term plan outputs used for immediate execution (no long-term retrieval described)
Planning
- Non-linear task planning (tree-structured plans)
- Constrained beam search for valid model sequences
Tool Use
- Selects, sequences and executes specialist models
- Calls web tools/APIs via LangChain (Google/Wikipedia/Wolfram)
Frameworks
- LangChain
- Hugging Face (transformers, diffusers)
Is Agentic
true
Architectures
- LLM controller + external domain expert models
Collaboration
- Orchestrates multiple models in parallel or sequence
Optimization Features
Model Optimization
- LoRA
System Optimization
- Plan parsing via a GPT-3.5-based parser to map free text to module sequences
Training Optimization
- RL
- Fine-tuning with human-labeled plan solutions
Inference Optimization
- Constrained beam search to ensure valid model-name outputs
Reproducibility
License
- Creative Commons Attribution 4.0 International (data); code repo states open‑sor
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmarks use small per-task datasets (100 samples each), limiting statistical strength
- Performance depends on the quality and domain coverage of plugged expert models
- OpenAGI relies on closed-source LLMs for best zero-shot results, reducing reproducibility for some users
- OOD generalization and optimal plan search remain unsolved and can cause poor outputs
When Not To Use
- Do not use as-is for safety-critical systems without human oversight
- Not ideal for single-step tasks where direct specialist models are already sufficient
- Avoid expecting guaranteed best-order model sequencing; planning can be suboptimal
Failure Modes
- LLM produces invalid or suboptimal model sequences that degrade output quality
- Plan hallucination: proposing nonexistent or inappropriate model calls
- Out-of-distribution inputs break specialist model performance leading to bad final outputs
Core Entities
Models
- GPT-3.5-turbo
- GPT-4
- Claude-2
- Flan-T5-Large
- Vicuna-7B
- LLaMA-2-13B
- Stable Diffusion
- Restormer
- Swin2SR
- GIT
- DETR
- ViT
- BART
- T5
- DistilRoBERTa
- DistilBERT
Metrics
- CLIP Score
- BERT Score
- ViT Score
Datasets
- ImageNet-1K
- COCO
- CNN/Daily Mail
- SST2
- TextVQA
- SQuAD
- OpenAGI benchmark datasets (185 tasks, 100 samples each)
Benchmarks
- OpenAGI multi-step task suite

