Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
AgentSwift reduces the number of expensive full evaluations by using a learned performance predictor and uncertainty-aware search, letting teams find stronger multi-step LLM agents faster and at lower API cost.
Summary TLDR
AgentSwift automates agent design by searching a hierarchical space that combines workflow structure with plug-in components (memory, planning, tool use). It trains a lightweight value model on a purposely sampled 220-agent dataset to predict agent performance cheaply, then uses uncertainty-aware MCTS (recombination, mutation, refinement) to guide search. On seven benchmarks across embodied, math, web, tool, and game tasks, AgentSwift finds agents that outperform manual and prior automated methods, with an average gain of 8.34% on evaluated tasks and faster convergence under a 60-evaluation budget.
Problem Statement
Automated agent design is slow and costly because (1) prior searches only tweak workflows and ignore modular components like planning, memory, and tool use; (2) evaluating each candidate agent on real benchmarks is expensive (tens of dollars per eval), and (3) naive search strategies fail to explore a large joint design space efficiently.
Main Contribution
A hierarchical search space that jointly optimizes agentic workflow and composable components (memory, planning, tool use).
A lightweight value model (7B backbone + adapters) trained on a 220-sample dataset built with pairwise covering arrays and balanced Bayesian sampling to predict agent performance cheaply.
An uncertainty-guided hierarchical MCTS search (selection + recombination, mutation, refinement) that uses the value model and its prediction uncertainty to prioritize promising candidates.
Key Findings
Average performance improvement over baselines
Value model predictive accuracy
Small labeled dataset suffices for transfer
Ablation impact of recombination
Search budget and sample efficiency
Results
Average improvement
ALFWorld success rate
Accuracy
Value model fit (surrogate)
Who Should Care
What To Try In 7 Days
Run AgentSwift codebase on a single internal task with a 60-evaluation cap to compare against your hand-tuned agent.
Train the provided 7B value model on a small set (≈30–50) of labeled agents for your task to get early surrogate-guided search.
Enable recombination + mutation stages; ablations show recombination is most important for gains.
Agent Features
Memory
- Memory component with prompt, temperature and backend
- Vector DB-style backends used in examples
Planning
- Planning component as prompt + temperature
- Adaptive hierarchical planning included in discovered agents
Tool Use
- Tool invocation prompt + toolset
- Search optimizes which tool modules to attach
Frameworks
- MCTS-driven search
- Value-model surrogate
- Balanced Bayesian sampling for dataset construction
Is Agentic
true
Architectures
- Hierarchical workflow + components
- LLM-invoking node graphs
- Composable plugins (memory, planning, tool use)
Optimization Features
Token Efficiency
- Not explicitly quantified
Infra Optimization
- Value model trained on 3 A100 GPUs (reported)
Model Optimization
- Lightweight adapter modules on 7B backbones
System Optimization
- Uncertainty-aware selection reduces wasted full evaluations
Training Optimization
- SFT
- Dataset built with pairwise covering array + Balanced-Extreme Bayesian sampling
Inference Optimization
- Value model used as cheap surrogate to score candidates instead of full LLM evals
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Proposer steps (recombination/mutation/refinement) rely on manual LLM prompts rather than a learned proposer.
- Value model needs labeled agent evaluations (they use 220 samples); collecting labels can still be costly in specialized domains.
- Search optimizes static agent designs; it does not handle online dynamic adaptation during deployment.
When Not To Use
- If you need agents that adapt in real time during deployment, this static-design search is not sufficient.
- When you cannot provide any labeled evaluations for your task; the value model requires at least a small labeled set for good predictions.
- If you cannot run a small surrogate model (7B backbone + adapters) due to hardware limits
Failure Modes
- Heuristic proposers may miss high-quality module implementations that a learned generator could find.
- Value model miscalibration can bias search toward overpromised candidates if uncertainty estimation is poor.
- Search may still explore subspaces that require expensive, domain-specific evaluations not covered by the 220-sample training set.
Core Entities
Models
- gpt-4o
- gpt-4o-mini
- DeepSeek-v3
- Mistral-7B-v0.3
- Qwen2.5-7B
- AgentSwift value model (7B + adapters)
Metrics
- Success Rate
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- R²
- Spearman rank correlation
Datasets
- ALFWorld
- ScienceWorld
- MATH (617-problem subset)
- WebShop
- M3ToolEval
- TravelPlanner
- PDDL
Benchmarks
- Embodied: ALFWorld, ScienceWorld
- Math: MATH subset
- Web: WebShop
- Tool: M3ToolEval, TravelPlanner
- Game: PDDL

