Let an LLM program better agents in code: Meta Agent Search discovers agent workflows that beat hand‑designed agents on several benchmarks

Overview

Decision SnapshotNeeds Validation

The method shows clear empirical gains across standard QA and math benchmarks, but experiments focus on single-step QA and use API-based FMs; maturity for production requires safer evaluation, cheaper evaluation functions, and multi-objective search.

Citations7

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 80%

Authors

Shengran Hu, Cong Lu, Jeff Clune

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

This paper proposes ADAS: automatically discover agentic systems by searching a code space. The authors introduce Meta Agent Search, an algorithm that uses a foundation model (meta agent) to write new agent code, self-reflect, evaluate on a validation set, and archive discoveries. On multiple benchmarks (ARC, DROP, MGSM, MMLU, GPQA) the discovered agents outperform common hand-designed agents and transfer well across domains and models. Code is open-sourced. The method is promising for automating agent design but depends on FM coding quality, is currently costly, and was tested mainly on single-step QA tasks.

Problem Statement

Building effective agents usually needs hand-crafted workflows, prompt hacks, and tool glue. Can we automate invention and assembly of agentic systems by letting a foundation model program agents in code, search that code space, and iteratively improve discoveries?

Main Contribution

Define Automated Design of Agentic Systems (ADAS) and formalize it as search over a search space, a search algorithm, and an evaluation function.

Propose Meta Agent Search: a meta foundation model that programs new agents in code, self-reflects, tests agents, and archives them for iterative discovery.

Key Findings

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

NumbersDROP F1 +13.6 pp (paper claim)

Practical UseIf you automate agent design with a meta agent, you can cut error rates on reading-comprehension tasks by double-digit points versus common manual agent designs; try automated search for QA pipelines before hand-tuning.

Evidence RefMain text (Section 4.2) and Table 1

Discovered agents deliver large gains on math benchmarks when searched within math tasks.

NumbersMGSM accuracy +14.4 pp (paper claim)

Practical UseFor math/problem-solving workloads, running meta agent search can yield much stronger reasoners than typical Chain-of-Thought or role-based prompts.

Evidence RefMain text (Section 4.2) and Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
DROP F1	79.4 ± 0.8	Chain-of-Thought 64.2 ± 0.9	+15.2 pp (vs COT) / +13.6 pp (paper claim vs SOTA)	DROP (Reading Comprehension)	Table 1 shows Meta Agent Search F1 79.4 ±0.8 and COT 64.2 ±0.9	Table 1
Accuracy	53.4 ± 3.5	Chain-of-Thought 28.0 ± 3.1	+25.4 pp (vs COT) / +14.4 pp (paper claim vs baselines aggregated)	MGSM (Math)	Table 1 reports 53.4 ±3.5 for Meta Agent Search vs 28.0 ±3.1 for COT	Table 1

What To Try In 7 Days

Run the authors' repo on a small in-house QA task and compare to your current prompt pipeline.

Seed the archive with 2–3 strong human agents (e.g., CoT, Self-Refine) and run 10–20 meta iterations to see early patterns.

Inspect discovered 'forward' functions to harvest reusable workflow patterns (ensembles, expert critics).

Agent Features

Memory

archive of discovered agents (external archive used for conditioning)iteration-indexed Info objects passed between modules

Planning

chain-of-thought prompting (CoT)self-reflection / iterative refinementensemble + expert critique feedback loops

Tool Use

FM query APIs (meta agent & modules)code-execution and test functions (ARC)expert critic modules and simulated human feedback

Frameworks

small custom framework (provided by authors)LangChain mentioned as a potential seed

Is Agentic

Yes

Architectures

code-defined agents (forward(taskInfo) functions)meta-agent that programs agents iterativelyarchive-based evolutionary stepping-stone design

Collaboration

role assignment and multi-expert committeespeer-review and critic modules across modules

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/ShengranHu/ADAS

Data URLs

ARC (Abstraction and Reasoning Corpus)DROPMGSMMMLUGPQAGSM8KGSM-Hard

Risks & Boundaries

Limitations

Search and evaluation are costly (authors report $300–$500 per run).

Experiments target mainly single-step QA tasks, not complex interactive environments.

When Not To Use

Safety-critical deployments where generated code cannot be fully sandboxed or audited.

Low-budget scenarios where API cost for repeated search is prohibitive.

Failure Modes

Meta agent can generate buggy or malicious code; requires containerization and manual review.

Overfitting to validation sets or discovered stepping-stones that are not broadly useful.

Core Entities

Models

gpt-4 / gpt-4o (meta agent in search)gpt-3.5-turbo (evaluation of discovered agents)claude-3-haikuclaude-3-5-sonnet

Metrics

AccuracyF195% bootstrap confidence intervalAPI cost (USD)

Datasets

ARCDROPMGSMMMLUGPQAGSM8KGSM-HardSVAMPASDiv

Benchmarks

ARCDROPMGSMMMLUGPQAGSM8KGSM-Hard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.

Discovered agents deliver large gains on math benchmarks when searched within math tasks.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding