Create, customize, and run multi-step LLM agents from plain language — no code needed

February 9, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Jiabin Tang, Tianyu Fan, Chao Huang

Links

Abstract / PDF

Why It Matters For Business

AutoAgent lowers the engineering barrier: product teams can prototype custom assistants, retrieval workflows, and API-backed tools from plain language, cutting specialist developer time and speeding deployment.

Summary TLDR

AutoAgent is a zero-code framework that turns plain-language requests into working multi-agent systems, tools, and workflows. Key pieces: a modular Orchestrator-Workers agent stack, an LLM-driven Actionable Engine (supports direct and XML-transformed tool calls), a self-managing file system that stores documents as vector DB chunks, and a self-play customization loop that generates agents and workflows as XML. Evaluations show strong results: second place on the GAIA generalist-agent leaderboard and clear gains on a multihop RAG benchmark. Code: https://github.com/HKUDS/AutoAgent.

Problem Statement

Building capable LLM agents today requires programming skill and prompt engineering. The authors argue this limits adoption since only a tiny fraction of people can code. They aim to let anyone create, customize, and run multi-agent workflows using only natural language, with automatic tool creation, debugging, and orchestration.

Main Contribution

A zero-code, language-driven OS for LLM agents that converts plain-language specs into runnable agents, tools, and workflows.

A modular Agentic System Utilities stack (Orchestrator, Web, Coding, Local File agents) with clear tool APIs and sandboxed execution.

An LLM-powered Actionable Engine that supports both direct tool-use and an XML-based transformed-tool paradigm for broader model support.

A Self-Managing File System that converts user files into vector DB chunks for retrieval-augmented workflows.

A Self-Play Agent Customization pipeline that generates and iteratively refines agents and workflows (XML forms), including automatic tool creation and self-debugging.

Empirical validation: GAIA leaderboard (2nd place) and improved results on a multi-hop RAG benchmark; practical case studies (image agent, financial agent, majority-vote workflow).

Key Findings

Strong GAIA performance — close to top commercial agents.

NumbersGAIA avg success: AutoAgent 55.15 vs top h2oGPTe 63.64; Level1 71.7%

Agent-based RAG gives large accuracy gains on multihop retrieval tasks.

NumbersMultiHop-RAG acc: AutoAgent 73.51% vs LangChain 62.83%; error 14.20% vs 20.50%

Zero-code workflows can improve multi-model math performance.

NumbersMajority-voting workflow pass@1: 75.6% vs best single model 74.2% (deepseek-v3)

System automates tool creation and debugging from APIs and docs.

Results

GAIA average success rate

Value55.15%

Baselineh2oGPTe Agent v1.6.8 (63.64%)

GAIA Level 1 success rate

Value71.7%

Baselineother state-of-the-art agents (no competitor >70%)

Accuracy

Value73.51% acc / 14.20% err

BaselineLangChain 62.83% acc / 20.50% err

Majority voting pass@1 (math reasoning)

Value75.6% pass@1

Baselinebest single model deepseek-v3 74.2% pass@1

Who Should Care

What To Try In 7 Days

Use the repo to generate a simple zero-code agent that answers domain docs (upload PDFs, let AutoAgent build the vector DB, run a query).

Prototype an agentic RAG pipeline on a small QA task and compare accuracy vs your current RAG setup.

Create a short workflow (e.g., parallel model voting) to see if majority voting improves correctness on a target reasoning task.

Agent Features

Memory

  • self-managing vector DB (documents -> 4096-token chunks)
  • action-observation pairs as short-term context

Planning

  • event-driven workflows
  • automatic workflow generation (XML forms)
  • self-play iterative refinement

Tool Use

  • direct tool-use (when supported by model)
  • transformed tool-use via XML (for models without native tool APIs)

Frameworks

  • Agentic-SDK
  • LiteLLM
  • BrowserGym
  • E2B sandbox

Is Agentic

true

Architectures

  • Orchestrator-Workers
  • modular multi-agent

Collaboration

  • handoff tool for transfers between agents
  • orchestrator-driven task delegation

Optimization Features

Token Efficiency

  • chunking strategy (256-token chunks for RAG retrieval)

Infra Optimization

  • Docker sandboxing for code execution
  • support for external sandboxes (E2B)

System Optimization

  • agent-level task delegation to reduce repeated context
  • paginated terminal and markdown browsing to manage long outputs

Inference Optimization

  • test-time scaling via parallel multi-model workflows (majority voting)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • GAIA evaluation uses strict string matching, which can undercount semantically correct answers.
  • Web-based tasks face anti-automation and dynamic content issues during browsing.
  • Performance depends on external LLM providers and their tool-use capabilities.
  • No formal safety or bias audits reported for generated agents and tools.

When Not To Use

  • High-assurance domains that require formal verification (medical, legal) without human oversight.
  • Environments with strict data governance where automatic API key embedding would violate policy.
  • Scenarios requiring deterministic, auditable computation not suited to LLM-driven code generation.

Failure Modes

  • XML parsing or syntax errors during auto-generated tool/agent creation (paper shows SyntaxError recovery traces).
  • Conflicting outputs from different models in multi-model workflows leading to wrong majority decisions.
  • Web scraping failures due to anti-bot measures or content drift.
  • Over-reliance on LLM hallucinations when source documents are missing or poorly indexed.

Core Entities

Models

  • gpt-4o-mini
  • gpt-4o-2024-08-06
  • claude-3-5-sonnet-20241022
  • deepseek-v3
  • text-embedding-3-small
  • LiteLLM

Metrics

  • success rate
  • Accuracy
  • error rate
  • pass@1

Datasets

  • GAIA
  • MultiHop-RAG
  • MATH-500
  • MultiHopRAG

Benchmarks

  • GAIA
  • MultiHop-RAG
  • MATH-500