Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

August 15, 20249 min

Overview

Production Readiness

0.4

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

7

Authors

Shengran Hu, Cong Lu, Jeff Clune

Links

Abstract / PDF

Why It Matters For Business

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Summary TLDR

This paper proposes ADAS: automatically discover agentic systems by searching a code space. The authors introduce Meta Agent Search, an algorithm that uses a foundation model (meta agent) to write new agent code, self-reflect, evaluate on a validation set, and archive discoveries. On multiple benchmarks (ARC, DROP, MGSM, MMLU, GPQA) the discovered agents outperform common hand-designed agents and transfer well across domains and models. Code is open-sourced. The method is promising for automating agent design but depends on FM coding quality, is currently costly, and was tested mainly on single-step QA tasks.

Problem Statement

Building effective agents usually needs hand-crafted workflows, prompt hacks, and tool glue. Can we automate invention and assembly of agentic systems by letting a foundation model program agents in code, search that code space, and iteratively improve discoveries?

Main Contribution

Define Automated Design of Agentic Systems (ADAS) and formalize it as search over a search space, a search algorithm, and an evaluation function.

Propose Meta Agent Search: a meta foundation model that programs new agents in code, self-reflects, tests agents, and archives them for iterative discovery.

Empirical demonstration: discovered agents outperform several hand-designed baselines across ARC, DROP, MGSM, MMLU, and GPQA and transfer across domains and models.

Release a small framework (<100 lines) and open-source code to reproduce the meta-agent-in-code workflow.

Key Findings

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

NumbersDROP F1 +13.6 pp (paper claim)

Discovered agents deliver large gains on math benchmarks when searched within math tasks.

NumbersMGSM accuracy +14.4 pp (paper claim)

Agents discovered on a math domain transfer to other math datasets with big gains.

NumbersGSM8K accuracy +25.9 pp; GSM-Hard +13.2 pp (paper claim)

Discovered agents generalize across foundation models and domains.

NumbersARC top discovered agent reached up to ~48.3% on Claude-Sonnet (Table 3)

Results

DROP F1

Value79.4 ± 0.8

BaselineChain-of-Thought 64.2 ± 0.9

Accuracy

Value53.4 ± 3.5

BaselineChain-of-Thought 28.0 ± 3.1

Accuracy

Value69.5 ± 3.2 (Dynamic Role-Playing Architecture)

BaselineChain-of-Thought 34.9 ± 3.2

Accuracy

Value≈13.7 ± 3.9 (best discovered agent)

BaselineChain-of-Thought 6.0 ± 2.7

API cost

Value$300–$500 per search run (reported)

Who Should Care

What To Try In 7 Days

Run the authors' repo on a small in-house QA task and compare to your current prompt pipeline.

Seed the archive with 2–3 strong human agents (e.g., CoT, Self-Refine) and run 10–20 meta iterations to see early patterns.

Inspect discovered 'forward' functions to harvest reusable workflow patterns (ensembles, expert critics).

Agent Features

Memory

  • archive of discovered agents (external archive used for conditioning)
  • iteration-indexed Info objects passed between modules

Planning

  • chain-of-thought prompting (CoT)
  • self-reflection / iterative refinement
  • ensemble + expert critique feedback loops

Tool Use

  • FM query APIs (meta agent & modules)
  • code-execution and test functions (ARC)
  • expert critic modules and simulated human feedback

Frameworks

  • small custom framework (provided by authors)
  • LangChain mentioned as a potential seed

Is Agentic

true

Architectures

  • code-defined agents (forward(taskInfo) functions)
  • meta-agent that programs agents iteratively
  • archive-based evolutionary stepping-stone design

Collaboration

  • role assignment and multi-expert committees
  • peer-review and critic modules across modules

Reproducibility

Data Urls

  • ARC (Abstraction and Reasoning Corpus)
  • DROP
  • MGSM
  • MMLU
  • GPQA
  • GSM8K
  • GSM-Hard

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Search and evaluation are costly (authors report $300–$500 per run).
  • Experiments target mainly single-step QA tasks, not complex interactive environments.
  • Evaluation optimizes a single metric (performance); latency, cost, and safety are not jointly optimized.
  • Quality depends on the meta agent FM's coding and reasoning ability; weaker FMs may yield poorer designs.

When Not To Use

  • Safety-critical deployments where generated code cannot be fully sandboxed or audited.
  • Low-budget scenarios where API cost for repeated search is prohibitive.
  • Interactive, multi-step environment control tasks not covered by single-step QA experiments.

Failure Modes

  • Meta agent can generate buggy or malicious code; requires containerization and manual review.
  • Overfitting to validation sets or discovered stepping-stones that are not broadly useful.
  • Reliance on FM internal knowledge limits gains when base model lacks necessary facts.
  • Search may get stuck producing variants of the same design without stronger novelty incentives.

Core Entities

Models

  • gpt-4 / gpt-4o (meta agent in search)
  • gpt-3.5-turbo (evaluation of discovered agents)
  • claude-3-haiku
  • claude-3-5-sonnet

Metrics

  • Accuracy
  • F1
  • 95% bootstrap confidence interval
  • API cost (USD)

Datasets

  • ARC
  • DROP
  • MGSM
  • MMLU
  • GPQA
  • GSM8K
  • GSM-Hard
  • SVAMP
  • ASDiv

Benchmarks

  • ARC
  • DROP
  • MGSM
  • MMLU
  • GPQA
  • GSM8K
  • GSM-Hard