Overview
The method pairs a compact surrogate with an uncertainty-aware MCTS to find better agents under a limited evaluation budget; results across seven benchmarks and ablations support claims.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
AgentSwift reduces the number of expensive full evaluations by using a learned performance predictor and uncertainty-aware search, letting teams find stronger multi-step LLM agents faster and at lower API cost.
Who Should Care
Summary TLDR
AgentSwift automates agent design by searching a hierarchical space that combines workflow structure with plug-in components (memory, planning, tool use). It trains a lightweight value model on a purposely sampled 220-agent dataset to predict agent performance cheaply, then uses uncertainty-aware MCTS (recombination, mutation, refinement) to guide search. On seven benchmarks across embodied, math, web, tool, and game tasks, AgentSwift finds agents that outperform manual and prior automated methods, with an average gain of 8.34% on evaluated tasks and faster convergence under a 60-evaluation budget.
Problem Statement
Automated agent design is slow and costly because (1) prior searches only tweak workflows and ignore modular components like planning, memory, and tool use; (2) evaluating each candidate agent on real benchmarks is expensive (tens of dollars per eval), and (3) naive search strategies fail to explore a large joint design space efficiently.
Main Contribution
A hierarchical search space that jointly optimizes agentic workflow and composable components (memory, planning, tool use).
A lightweight value model (7B backbone + adapters) trained on a 220-sample dataset built with pairwise covering arrays and balanced Bayesian sampling to predict agent performance cheaply.
Key Findings
Average performance improvement over baselines
Value model predictive accuracy
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average improvement | +8.34% | state-of-the-art automated agent search + manual agents | — | seven benchmarks (ALFWorld, SciWorld, MATH, WebShop, M3ToolEval, TravelPlanner, PDDL) | Main results claim average +8.34% across seven benchmarks; Table 1 | Table 1 |
| ALFWorld success rate | 0.806 ± 0.007 | AgentSquare 0.701 ± 0.07 | +0.105 | ALFWorld (GPT-4o-mini eval) | Table 1 shows AgentSwift 0.806 vs AgentSquare 0.701 on ALFWorld | Table 1 |
What To Try In 7 Days
Run AgentSwift codebase on a single internal task with a 60-evaluation cap to compare against your hand-tuned agent.
Train the provided 7B value model on a small set (≈30–50) of labeled agents for your task to get early surrogate-guided search.
Enable recombination + mutation stages; ablations show recombination is most important for gains.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Proposer steps (recombination/mutation/refinement) rely on manual LLM prompts rather than a learned proposer.
Value model needs labeled agent evaluations (they use 220 samples); collecting labels can still be costly in specialized domains.
When Not To Use
If you need agents that adapt in real time during deployment, this static-design search is not sufficient.
When you cannot provide any labeled evaluations for your task; the value model requires at least a small labeled set for good predictions.
Failure Modes
Heuristic proposers may miss high-quality module implementations that a learned generator could find.
Value model miscalibration can bias search toward overpromised candidates if uncertainty estimation is poor.

