Overview
The method shows clear empirical gains across standard QA and math benchmarks, but experiments focus on single-step QA and use API-based FMs; maturity for production requires safer evaluation, cheaper evaluation functions, and multi-objective search.
Citations7
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 80%
Why It Matters For Business
Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.
Who Should Care
Summary TLDR
This paper proposes ADAS: automatically discover agentic systems by searching a code space. The authors introduce Meta Agent Search, an algorithm that uses a foundation model (meta agent) to write new agent code, self-reflect, evaluate on a validation set, and archive discoveries. On multiple benchmarks (ARC, DROP, MGSM, MMLU, GPQA) the discovered agents outperform common hand-designed agents and transfer well across domains and models. Code is open-sourced. The method is promising for automating agent design but depends on FM coding quality, is currently costly, and was tested mainly on single-step QA tasks.
Problem Statement
Building effective agents usually needs hand-crafted workflows, prompt hacks, and tool glue. Can we automate invention and assembly of agentic systems by letting a foundation model program agents in code, search that code space, and iteratively improve discoveries?
Main Contribution
Define Automated Design of Agentic Systems (ADAS) and formalize it as search over a search space, a search algorithm, and an evaluation function.
Propose Meta Agent Search: a meta foundation model that programs new agents in code, self-reflects, tests agents, and archives them for iterative discovery.
Key Findings
Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.
Discovered agents deliver large gains on math benchmarks when searched within math tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| DROP F1 | 79.4 ± 0.8 | Chain-of-Thought 64.2 ± 0.9 | +15.2 pp (vs COT) / +13.6 pp (paper claim vs SOTA) | DROP (Reading Comprehension) | Table 1 shows Meta Agent Search F1 79.4 ±0.8 and COT 64.2 ±0.9 | Table 1 |
| Accuracy | 53.4 ± 3.5 | Chain-of-Thought 28.0 ± 3.1 | +25.4 pp (vs COT) / +14.4 pp (paper claim vs baselines aggregated) | MGSM (Math) | Table 1 reports 53.4 ±3.5 for Meta Agent Search vs 28.0 ±3.1 for COT | Table 1 |
What To Try In 7 Days
Run the authors' repo on a small in-house QA task and compare to your current prompt pipeline.
Seed the archive with 2–3 strong human agents (e.g., CoT, Self-Refine) and run 10–20 meta iterations to see early patterns.
Inspect discovered 'forward' functions to harvest reusable workflow patterns (ensembles, expert critics).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Search and evaluation are costly (authors report $300–$500 per run).
Experiments target mainly single-step QA tasks, not complex interactive environments.
When Not To Use
Safety-critical deployments where generated code cannot be fully sandboxed or audited.
Low-budget scenarios where API cost for repeated search is prohibitive.
Failure Modes
Meta agent can generate buggy or malicious code; requires containerization and manual review.
Overfitting to validation sets or discovered stepping-stones that are not broadly useful.

