Create, customize, and run multi-step LLM agents from plain language — no code needed

Overview

Decision SnapshotNeeds Validation

AutoAgent provides a practical zero-code path to working agents and RAG pipelines; evidence comes from GAIA leaderboard placement and targeted RAG/math experiments, but real-world readiness depends on your model provider, security needs, and test coverage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jiabin Tang, Tianyu Fan, Chao Huang

Links

Abstract / PDF / Code

Why It Matters For Business

AutoAgent lowers the engineering barrier: product teams can prototype custom assistants, retrieval workflows, and API-backed tools from plain language, cutting specialist developer time and speeding deployment.

Who Should Care

Product Manager Founder CTO Engineering Lead ML Engineer Data Scientist

Summary TLDR

AutoAgent is a zero-code framework that turns plain-language requests into working multi-agent systems, tools, and workflows. Key pieces: a modular Orchestrator-Workers agent stack, an LLM-driven Actionable Engine (supports direct and XML-transformed tool calls), a self-managing file system that stores documents as vector DB chunks, and a self-play customization loop that generates agents and workflows as XML. Evaluations show strong results: second place on the GAIA generalist-agent leaderboard and clear gains on a multihop RAG benchmark. Code: https://github.com/HKUDS/AutoAgent.

Problem Statement

Building capable LLM agents today requires programming skill and prompt engineering. The authors argue this limits adoption since only a tiny fraction of people can code. They aim to let anyone create, customize, and run multi-agent workflows using only natural language, with automatic tool creation, debugging, and orchestration.

Main Contribution

A zero-code, language-driven OS for LLM agents that converts plain-language specs into runnable agents, tools, and workflows.

A modular Agentic System Utilities stack (Orchestrator, Web, Coding, Local File agents) with clear tool APIs and sandboxed execution.

Key Findings

Strong GAIA performance — close to top commercial agents.

NumbersGAIA avg success: AutoAgent 55.15 vs top h2oGPTe 63.64; Level1 71.7%

Practical UseAutoAgent can handle many everyday assistant tasks out of the box; expect near-state-of-the-art generalist multi-agent behavior without coding.

Evidence RefTable 1 (GAIA leaderboard)

Agent-based RAG gives large accuracy gains on multihop retrieval tasks.

NumbersMultiHop-RAG acc: AutoAgent 73.51% vs LangChain 62.83%; error 14.20% vs 20.50%

Practical UseUse AutoAgent's agentic retrieval and orchestration when accuracy on multi-hop document QA matters; it reduces confident wrong answers.

Evidence RefTable 2 (RAG results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GAIA average success rate	55.15%	h2oGPTe Agent v1.6.8 (63.64%)	-8.49 pp	GAIA validation	Table 1 reports AutoAgent 55.15 avg vs h2oGPTe 63.64	Table 1
GAIA Level 1 success rate	71.7%	other state-of-the-art agents (no competitor >70%)	first >70% reported on L1	GAIA level 1	Table 1 lists Level 1 = 71.7% for AutoAgent	Table 1

What To Try In 7 Days

Use the repo to generate a simple zero-code agent that answers domain docs (upload PDFs, let AutoAgent build the vector DB, run a query).

Prototype an agentic RAG pipeline on a small QA task and compare accuracy vs your current RAG setup.

Create a short workflow (e.g., parallel model voting) to see if majority voting improves correctness on a target reasoning task.

Agent Features

Memory

self-managing vector DB (documents -> 4096-token chunks)action-observation pairs as short-term context

Planning

event-driven workflowsautomatic workflow generation (XML forms)self-play iterative refinement

Tool Use

direct tool-use (when supported by model)transformed tool-use via XML (for models without native tool APIs)

Frameworks

Agentic-SDKLiteLLMBrowserGymE2B sandbox

Is Agentic

Yes

Architectures

Orchestrator-Workersmodular multi-agent

Collaboration

handoff tool for transfers between agentsorchestrator-driven task delegation

Optimization Features

Token Efficiency

chunking strategy (256-token chunks for RAG retrieval)

Infra Optimization

Docker sandboxing for code executionsupport for external sandboxes (E2B)

System Optimization

agent-level task delegation to reduce repeated contextpaginated terminal and markdown browsing to manage long outputs

Inference Optimization

test-time scaling via parallel multi-model workflows (majority voting)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HKUDS/AutoAgent

Risks & Boundaries

Limitations

GAIA evaluation uses strict string matching, which can undercount semantically correct answers.

Web-based tasks face anti-automation and dynamic content issues during browsing.

When Not To Use

High-assurance domains that require formal verification (medical, legal) without human oversight.

Environments with strict data governance where automatic API key embedding would violate policy.

Failure Modes

XML parsing or syntax errors during auto-generated tool/agent creation (paper shows SyntaxError recovery traces).

Conflicting outputs from different models in multi-model workflows leading to wrong majority decisions.

Core Entities

Models

gpt-4o-minigpt-4o-2024-08-06claude-3-5-sonnet-20241022deepseek-v3text-embedding-3-smallLiteLLM

Metrics

success rateAccuracyerror ratepass@1

Datasets

GAIAMultiHop-RAGMATH-500MultiHopRAG

Benchmarks

GAIAMultiHop-RAGMATH-500

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Strong GAIA performance — close to top commercial agents.

Agent-based RAG gives large accuracy gains on multihop retrieval tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding