Overview
Production Readiness
0.4
Novelty Score
0.8
Cost Impact Score
0.6
Citation Count
7
Why It Matters For Business
Automated agent design can reduce manual engineering time and produce stronger task-specific agents, cutting error rates on QA and math tasks and enabling faster iteration on agent workflows.
Summary TLDR
This paper proposes ADAS: automatically discover agentic systems by searching a code space. The authors introduce Meta Agent Search, an algorithm that uses a foundation model (meta agent) to write new agent code, self-reflect, evaluate on a validation set, and archive discoveries. On multiple benchmarks (ARC, DROP, MGSM, MMLU, GPQA) the discovered agents outperform common hand-designed agents and transfer well across domains and models. Code is open-sourced. The method is promising for automating agent design but depends on FM coding quality, is currently costly, and was tested mainly on single-step QA tasks.
Problem Statement
Building effective agents usually needs hand-crafted workflows, prompt hacks, and tool glue. Can we automate invention and assembly of agentic systems by letting a foundation model program agents in code, search that code space, and iteratively improve discoveries?
Main Contribution
Define Automated Design of Agentic Systems (ADAS) and formalize it as search over a search space, a search algorithm, and an evaluation function.
Propose Meta Agent Search: a meta foundation model that programs new agents in code, self-reflects, tests agents, and archives them for iterative discovery.
Empirical demonstration: discovered agents outperform several hand-designed baselines across ARC, DROP, MGSM, MMLU, and GPQA and transfer across domains and models.
Release a small framework (<100 lines) and open-source code to reproduce the meta-agent-in-code workflow.
Key Findings
Meta Agent Search finds agents that substantially improve reading-comprehension performance over hand-designed agents.
Discovered agents deliver large gains on math benchmarks when searched within math tasks.
Agents discovered on a math domain transfer to other math datasets with big gains.
Discovered agents generalize across foundation models and domains.
Results
DROP F1
Accuracy
Accuracy
Accuracy
API cost
Who Should Care
What To Try In 7 Days
Run the authors' repo on a small in-house QA task and compare to your current prompt pipeline.
Seed the archive with 2–3 strong human agents (e.g., CoT, Self-Refine) and run 10–20 meta iterations to see early patterns.
Inspect discovered 'forward' functions to harvest reusable workflow patterns (ensembles, expert critics).
Agent Features
Memory
- archive of discovered agents (external archive used for conditioning)
- iteration-indexed Info objects passed between modules
Planning
- chain-of-thought prompting (CoT)
- self-reflection / iterative refinement
- ensemble + expert critique feedback loops
Tool Use
- FM query APIs (meta agent & modules)
- code-execution and test functions (ARC)
- expert critic modules and simulated human feedback
Frameworks
- small custom framework (provided by authors)
- LangChain mentioned as a potential seed
Is Agentic
true
Architectures
- code-defined agents (forward(taskInfo) functions)
- meta-agent that programs agents iteratively
- archive-based evolutionary stepping-stone design
Collaboration
- role assignment and multi-expert committees
- peer-review and critic modules across modules
Reproducibility
Code Urls
Data Urls
- ARC (Abstraction and Reasoning Corpus)
- DROP
- MGSM
- MMLU
- GPQA
- GSM8K
- GSM-Hard
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Search and evaluation are costly (authors report $300–$500 per run).
- Experiments target mainly single-step QA tasks, not complex interactive environments.
- Evaluation optimizes a single metric (performance); latency, cost, and safety are not jointly optimized.
- Quality depends on the meta agent FM's coding and reasoning ability; weaker FMs may yield poorer designs.
When Not To Use
- Safety-critical deployments where generated code cannot be fully sandboxed or audited.
- Low-budget scenarios where API cost for repeated search is prohibitive.
- Interactive, multi-step environment control tasks not covered by single-step QA experiments.
Failure Modes
- Meta agent can generate buggy or malicious code; requires containerization and manual review.
- Overfitting to validation sets or discovered stepping-stones that are not broadly useful.
- Reliance on FM internal knowledge limits gains when base model lacks necessary facts.
- Search may get stuck producing variants of the same design without stronger novelty incentives.
Core Entities
Models
- gpt-4 / gpt-4o (meta agent in search)
- gpt-3.5-turbo (evaluation of discovered agents)
- claude-3-haiku
- claude-3-5-sonnet
Metrics
- Accuracy
- F1
- 95% bootstrap confidence interval
- API cost (USD)
Datasets
- ARC
- DROP
- MGSM
- MMLU
- GPQA
- GSM8K
- GSM-Hard
- SVAMP
- ASDiv
Benchmarks
- ARC
- DROP
- MGSM
- MMLU
- GPQA
- GSM8K
- GSM-Hard

