Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

August 15, 20249 min

Overview

Decision SnapshotNeeds Validation

The method shows clear empirical gains across standard QA and math benchmarks, but experiments focus on single-step QA and use API-based FMs; maturity for production requires safer evaluation, cheaper evaluation functions, and multi-objective search.

Citations7

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 80%

Authors

Shengran Hu, Cong Lu, Jeff Clune

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Who Should Care

Summary TLDR

This paper proposes ADAS: automatically discover agentic systems by searching a code space. The authors introduce Meta Agent Search, an algorithm that uses a foundation model (meta agent) to write new agent code, self-reflect, evaluate on a validation set, and archive discoveries. On multiple benchmarks (ARC, DROP, MGSM, MMLU, GPQA) the discovered agents outperform common hand-designed agents and transfer well across domains and models. Code is open-sourced. The method is promising for automating agent design but depends on FM coding quality, is currently costly, and was tested mainly on single-step QA tasks.

Problem Statement

Building effective agents usually needs hand-crafted workflows, prompt hacks, and tool glue. Can we automate invention and assembly of agentic systems by letting a foundation model program agents in code, search that code space, and iteratively improve discoveries?

Main Contribution

Define Automated Design of Agentic Systems (ADAS) and formalize it as search over a search space, a search algorithm, and an evaluation function.

Propose Meta Agent Search: a meta foundation model that programs new agents in code, self-reflects, tests agents, and archives them for iterative discovery.

Key Findings

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

NumbersDROP F1 +13.6 pp (paper claim)

Practical UseIf you automate agent design with a meta agent, you can cut error rates on reading-comprehension tasks by double-digit points versus common manual agent designs; try automated search for QA pipelines before hand-tuning.

Evidence RefMain text (Section 4.2) and Table 1

Discovered agents deliver large gains on math benchmarks when searched within math tasks.

NumbersMGSM accuracy +14.4 pp (paper claim)

Practical UseFor math/problem-solving workloads, running meta agent search can yield much stronger reasoners than typical Chain-of-Thought or role-based prompts.

Evidence RefMain text (Section 4.2) and Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
DROP F179.4 ± 0.8Chain-of-Thought 64.2 ± 0.9+15.2 pp (vs COT) / +13.6 pp (paper claim vs SOTA)DROP (Reading Comprehension)Table 1 shows Meta Agent Search F1 79.4 ±0.8 and COT 64.2 ±0.9Table 1
Accuracy53.4 ± 3.5Chain-of-Thought 28.0 ± 3.1+25.4 pp (vs COT) / +14.4 pp (paper claim vs baselines aggregated)MGSM (Math)Table 1 reports 53.4 ±3.5 for Meta Agent Search vs 28.0 ±3.1 for COTTable 1

What To Try In 7 Days

Run the authors' repo on a small in-house QA task and compare to your current prompt pipeline.

Seed the archive with 2–3 strong human agents (e.g., CoT, Self-Refine) and run 10–20 meta iterations to see early patterns.

Inspect discovered 'forward' functions to harvest reusable workflow patterns (ensembles, expert critics).

Agent Features

Memory
archive of discovered agents (external archive used for conditioning)iteration-indexed Info objects passed between modules
Planning
chain-of-thought prompting (CoT)self-reflection / iterative refinementensemble + expert critique feedback loops
Tool Use
FM query APIs (meta agent & modules)code-execution and test functions (ARC)expert critic modules and simulated human feedback
Frameworks
small custom framework (provided by authors)LangChain mentioned as a potential seed
Is Agentic

Yes

Architectures
code-defined agents (forward(taskInfo) functions)meta-agent that programs agents iterativelyarchive-based evolutionary stepping-stone design
Collaboration
role assignment and multi-expert committeespeer-review and critic modules across modules

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

ARC (Abstraction and Reasoning Corpus)DROPMGSMMMLUGPQAGSM8KGSM-Hard

Risks & Boundaries

Limitations

Search and evaluation are costly (authors report $300–$500 per run).

Experiments target mainly single-step QA tasks, not complex interactive environments.

When Not To Use

Safety-critical deployments where generated code cannot be fully sandboxed or audited.

Low-budget scenarios where API cost for repeated search is prohibitive.

Failure Modes

Meta agent can generate buggy or malicious code; requires containerization and manual review.

Overfitting to validation sets or discovered stepping-stones that are not broadly useful.

Core Entities

Models

gpt-4 / gpt-4o (meta agent in search)gpt-3.5-turbo (evaluation of discovered agents)claude-3-haikuclaude-3-5-sonnet

Metrics

AccuracyF195% bootstrap confidence intervalAPI cost (USD)

Datasets

ARCDROPMGSMMMLUGPQAGSM8KGSM-HardSVAMPASDiv

Benchmarks

ARCDROPMGSMMMLUGPQAGSM8KGSM-Hard