Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

June 6, 20258 min

Overview

Decision SnapshotReady For Pilot

The method pairs a compact surrogate with an uncertainty-aware MCTS to find better agents under a limited evaluation budget; results across seven benchmarks and ablations support claims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Links

Abstract / PDF / Code

Why It Matters For Business

AgentSwift reduces the number of expensive full evaluations by using a learned performance predictor and uncertainty-aware search, letting teams find stronger multi-step LLM agents faster and at lower API cost.

Who Should Care

Summary TLDR

AgentSwift automates agent design by searching a hierarchical space that combines workflow structure with plug-in components (memory, planning, tool use). It trains a lightweight value model on a purposely sampled 220-agent dataset to predict agent performance cheaply, then uses uncertainty-aware MCTS (recombination, mutation, refinement) to guide search. On seven benchmarks across embodied, math, web, tool, and game tasks, AgentSwift finds agents that outperform manual and prior automated methods, with an average gain of 8.34% on evaluated tasks and faster convergence under a 60-evaluation budget.

Problem Statement

Automated agent design is slow and costly because (1) prior searches only tweak workflows and ignore modular components like planning, memory, and tool use; (2) evaluating each candidate agent on real benchmarks is expensive (tens of dollars per eval), and (3) naive search strategies fail to explore a large joint design space efficiently.

Main Contribution

A hierarchical search space that jointly optimizes agentic workflow and composable components (memory, planning, tool use).

A lightweight value model (7B backbone + adapters) trained on a 220-sample dataset built with pairwise covering arrays and balanced Bayesian sampling to predict agent performance cheaply.

Key Findings

Average performance improvement over baselines

Numbersavg +8.34% over baselines on seven benchmarks

Practical UseExpect a measurable boost in task success by running AgentSwift instead of manual agents or prior search methods on similar benchmarks.

Evidence RefMain text; Table 1

Value model predictive accuracy

NumbersMSE 0.006, R² 0.8068, Spearman 0.9026 (AgentSwift mistral)

Practical UseA compact surrogate can rank and score candidates accurately, cutting the number of costly full evaluations needed in search.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average improvement+8.34%state-of-the-art automated agent search + manual agentsseven benchmarks (ALFWorld, SciWorld, MATH, WebShop, M3ToolEval, TravelPlanner, PDDL)Main results claim average +8.34% across seven benchmarks; Table 1Table 1
ALFWorld success rate0.806 ± 0.007AgentSquare 0.701 ± 0.07+0.105ALFWorld (GPT-4o-mini eval)Table 1 shows AgentSwift 0.806 vs AgentSquare 0.701 on ALFWorldTable 1

What To Try In 7 Days

Run AgentSwift codebase on a single internal task with a 60-evaluation cap to compare against your hand-tuned agent.

Train the provided 7B value model on a small set (≈30–50) of labeled agents for your task to get early surrogate-guided search.

Enable recombination + mutation stages; ablations show recombination is most important for gains.

Agent Features

Memory
Memory component with prompt, temperature and backendVector DB-style backends used in examples
Planning
Planning component as prompt + temperatureAdaptive hierarchical planning included in discovered agents
Tool Use
Tool invocation prompt + toolsetSearch optimizes which tool modules to attach
Frameworks
MCTS-driven searchValue-model surrogateBalanced Bayesian sampling for dataset construction
Is Agentic

Yes

Architectures
Hierarchical workflow + componentsLLM-invoking node graphsComposable plugins (memory, planning, tool use)

Optimization Features

Token Efficiency
Not explicitly quantified
Infra Optimization
Value model trained on 3 A100 GPUs (reported)
Model Optimization
Lightweight adapter modules on 7B backbones
System Optimization
Uncertainty-aware selection reduces wasted full evaluations
Training Optimization
SFTDataset built with pairwise covering array + Balanced-Extreme Bayesian sampling
Inference Optimization
Value model used as cheap surrogate to score candidates instead of full LLM evals

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Proposer steps (recombination/mutation/refinement) rely on manual LLM prompts rather than a learned proposer.

Value model needs labeled agent evaluations (they use 220 samples); collecting labels can still be costly in specialized domains.

When Not To Use

If you need agents that adapt in real time during deployment, this static-design search is not sufficient.

When you cannot provide any labeled evaluations for your task; the value model requires at least a small labeled set for good predictions.

Failure Modes

Heuristic proposers may miss high-quality module implementations that a learned generator could find.

Value model miscalibration can bias search toward overpromised candidates if uncertainty estimation is poor.

Core Entities

Models

gpt-4ogpt-4o-miniDeepSeek-v3Mistral-7B-v0.3Qwen2.5-7BAgentSwift value model (7B + adapters)

Metrics

Success RateMean Squared Error (MSE)Mean Absolute Error (MAE)Spearman rank correlation

Datasets

ALFWorldScienceWorldMATH (617-problem subset)WebShopM3ToolEvalTravelPlannerPDDL

Benchmarks

Embodied: ALFWorld, ScienceWorldMath: MATH subsetWeb: WebShopTool: M3ToolEval, TravelPlannerGame: PDDL