Create, customize, and run multi-step LLM agents from plain language — no code needed

February 9, 20258 min

Overview

Decision SnapshotNeeds Validation

AutoAgent provides a practical zero-code path to working agents and RAG pipelines; evidence comes from GAIA leaderboard placement and targeted RAG/math experiments, but real-world readiness depends on your model provider, security needs, and test coverage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jiabin Tang, Tianyu Fan, Chao Huang

Links

Abstract / PDF / Code

Why It Matters For Business

AutoAgent lowers the engineering barrier: product teams can prototype custom assistants, retrieval workflows, and API-backed tools from plain language, cutting specialist developer time and speeding deployment.

Who Should Care

Summary TLDR

AutoAgent is a zero-code framework that turns plain-language requests into working multi-agent systems, tools, and workflows. Key pieces: a modular Orchestrator-Workers agent stack, an LLM-driven Actionable Engine (supports direct and XML-transformed tool calls), a self-managing file system that stores documents as vector DB chunks, and a self-play customization loop that generates agents and workflows as XML. Evaluations show strong results: second place on the GAIA generalist-agent leaderboard and clear gains on a multihop RAG benchmark. Code: https://github.com/HKUDS/AutoAgent.

Problem Statement

Building capable LLM agents today requires programming skill and prompt engineering. The authors argue this limits adoption since only a tiny fraction of people can code. They aim to let anyone create, customize, and run multi-agent workflows using only natural language, with automatic tool creation, debugging, and orchestration.

Main Contribution

A zero-code, language-driven OS for LLM agents that converts plain-language specs into runnable agents, tools, and workflows.

A modular Agentic System Utilities stack (Orchestrator, Web, Coding, Local File agents) with clear tool APIs and sandboxed execution.

Key Findings

Strong GAIA performance — close to top commercial agents.

NumbersGAIA avg success: AutoAgent 55.15 vs top h2oGPTe 63.64; Level1 71.7%

Practical UseAutoAgent can handle many everyday assistant tasks out of the box; expect near-state-of-the-art generalist multi-agent behavior without coding.

Evidence RefTable 1 (GAIA leaderboard)

Agent-based RAG gives large accuracy gains on multihop retrieval tasks.

NumbersMultiHop-RAG acc: AutoAgent 73.51% vs LangChain 62.83%; error 14.20% vs 20.50%

Practical UseUse AutoAgent's agentic retrieval and orchestration when accuracy on multi-hop document QA matters; it reduces confident wrong answers.

Evidence RefTable 2 (RAG results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GAIA average success rate55.15%h2oGPTe Agent v1.6.8 (63.64%)-8.49 ppGAIA validationTable 1 reports AutoAgent 55.15 avg vs h2oGPTe 63.64Table 1
GAIA Level 1 success rate71.7%other state-of-the-art agents (no competitor >70%)first >70% reported on L1GAIA level 1Table 1 lists Level 1 = 71.7% for AutoAgentTable 1

What To Try In 7 Days

Use the repo to generate a simple zero-code agent that answers domain docs (upload PDFs, let AutoAgent build the vector DB, run a query).

Prototype an agentic RAG pipeline on a small QA task and compare accuracy vs your current RAG setup.

Create a short workflow (e.g., parallel model voting) to see if majority voting improves correctness on a target reasoning task.

Agent Features

Memory
self-managing vector DB (documents -> 4096-token chunks)action-observation pairs as short-term context
Planning
event-driven workflowsautomatic workflow generation (XML forms)self-play iterative refinement
Tool Use
direct tool-use (when supported by model)transformed tool-use via XML (for models without native tool APIs)
Frameworks
Agentic-SDKLiteLLMBrowserGymE2B sandbox
Is Agentic

Yes

Architectures
Orchestrator-Workersmodular multi-agent
Collaboration
handoff tool for transfers between agentsorchestrator-driven task delegation

Optimization Features

Token Efficiency
chunking strategy (256-token chunks for RAG retrieval)
Infra Optimization
Docker sandboxing for code executionsupport for external sandboxes (E2B)
System Optimization
agent-level task delegation to reduce repeated contextpaginated terminal and markdown browsing to manage long outputs
Inference Optimization
test-time scaling via parallel multi-model workflows (majority voting)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

GAIA evaluation uses strict string matching, which can undercount semantically correct answers.

Web-based tasks face anti-automation and dynamic content issues during browsing.

When Not To Use

High-assurance domains that require formal verification (medical, legal) without human oversight.

Environments with strict data governance where automatic API key embedding would violate policy.

Failure Modes

XML parsing or syntax errors during auto-generated tool/agent creation (paper shows SyntaxError recovery traces).

Conflicting outputs from different models in multi-model workflows leading to wrong majority decisions.

Core Entities

Models

gpt-4o-minigpt-4o-2024-08-06claude-3-5-sonnet-20241022deepseek-v3text-embedding-3-smallLiteLLM

Metrics

success rateAccuracyerror ratepass@1

Datasets

GAIAMultiHop-RAGMATH-500MultiHopRAG

Benchmarks

GAIAMultiHop-RAGMATH-500