Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Overview

Decision SnapshotReady For Pilot

The method pairs a compact surrogate with an uncertainty-aware MCTS to find better agents under a limited evaluation budget; results across seven benchmarks and ablations support claims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Links

Abstract / PDF / Code

Why It Matters For Business

AgentSwift reduces the number of expensive full evaluations by using a learned performance predictor and uncertainty-aware search, letting teams find stronger multi-step LLM agents faster and at lower API cost.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

AgentSwift automates agent design by searching a hierarchical space that combines workflow structure with plug-in components (memory, planning, tool use). It trains a lightweight value model on a purposely sampled 220-agent dataset to predict agent performance cheaply, then uses uncertainty-aware MCTS (recombination, mutation, refinement) to guide search. On seven benchmarks across embodied, math, web, tool, and game tasks, AgentSwift finds agents that outperform manual and prior automated methods, with an average gain of 8.34% on evaluated tasks and faster convergence under a 60-evaluation budget.

Problem Statement

Automated agent design is slow and costly because (1) prior searches only tweak workflows and ignore modular components like planning, memory, and tool use; (2) evaluating each candidate agent on real benchmarks is expensive (tens of dollars per eval), and (3) naive search strategies fail to explore a large joint design space efficiently.

Main Contribution

A hierarchical search space that jointly optimizes agentic workflow and composable components (memory, planning, tool use).

A lightweight value model (7B backbone + adapters) trained on a 220-sample dataset built with pairwise covering arrays and balanced Bayesian sampling to predict agent performance cheaply.

Key Findings

Average performance improvement over baselines

Numbersavg +8.34% over baselines on seven benchmarks

Practical UseExpect a measurable boost in task success by running AgentSwift instead of manual agents or prior search methods on similar benchmarks.

Evidence RefMain text; Table 1

Value model predictive accuracy

NumbersMSE 0.006, R² 0.8068, Spearman 0.9026 (AgentSwift mistral)

Practical UseA compact surrogate can rank and score candidates accurately, cutting the number of costly full evaluations needed in search.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average improvement	+8.34%	state-of-the-art automated agent search + manual agents	—	seven benchmarks (ALFWorld, SciWorld, MATH, WebShop, M3ToolEval, TravelPlanner, PDDL)	Main results claim average +8.34% across seven benchmarks; Table 1	Table 1
ALFWorld success rate	0.806 ± 0.007	AgentSquare 0.701 ± 0.07	+0.105	ALFWorld (GPT-4o-mini eval)	Table 1 shows AgentSwift 0.806 vs AgentSquare 0.701 on ALFWorld	Table 1

What To Try In 7 Days

Run AgentSwift codebase on a single internal task with a 60-evaluation cap to compare against your hand-tuned agent.

Train the provided 7B value model on a small set (≈30–50) of labeled agents for your task to get early surrogate-guided search.

Enable recombination + mutation stages; ablations show recombination is most important for gains.

Agent Features

Memory

Memory component with prompt, temperature and backendVector DB-style backends used in examples

Planning

Planning component as prompt + temperatureAdaptive hierarchical planning included in discovered agents

Tool Use

Tool invocation prompt + toolsetSearch optimizes which tool modules to attach

Frameworks

MCTS-driven searchValue-model surrogateBalanced Bayesian sampling for dataset construction

Is Agentic

Yes

Architectures

Hierarchical workflow + componentsLLM-invoking node graphsComposable plugins (memory, planning, tool use)

Optimization Features

Token Efficiency

Not explicitly quantified

Infra Optimization

Value model trained on 3 A100 GPUs (reported)

Model Optimization

Lightweight adapter modules on 7B backbones

System Optimization

Uncertainty-aware selection reduces wasted full evaluations

Training Optimization

SFTDataset built with pairwise covering array + Balanced-Extreme Bayesian sampling

Inference Optimization

Value model used as cheap surrogate to score candidates instead of full LLM evals

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Ericccc02/AgentSwift

Risks & Boundaries

Limitations

Proposer steps (recombination/mutation/refinement) rely on manual LLM prompts rather than a learned proposer.

Value model needs labeled agent evaluations (they use 220 samples); collecting labels can still be costly in specialized domains.

When Not To Use

If you need agents that adapt in real time during deployment, this static-design search is not sufficient.

When you cannot provide any labeled evaluations for your task; the value model requires at least a small labeled set for good predictions.

Failure Modes

Heuristic proposers may miss high-quality module implementations that a learned generator could find.

Value model miscalibration can bias search toward overpromised candidates if uncertainty estimation is poor.

Core Entities

Models

gpt-4ogpt-4o-miniDeepSeek-v3Mistral-7B-v0.3Qwen2.5-7BAgentSwift value model (7B + adapters)

Metrics

Success RateMean Squared Error (MSE)Mean Absolute Error (MAE)R²Spearman rank correlation

Datasets

ALFWorldScienceWorldMATH (617-problem subset)WebShopM3ToolEvalTravelPlannerPDDL

Benchmarks

Embodied: ALFWorld, ScienceWorldMath: MATH subsetWeb: WebShopTool: M3ToolEval, TravelPlannerGame: PDDL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Average performance improvement over baselines

Value model predictive accuracy

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding

Use hierarchical LLM search to turn coarse directions into lab-ready hypotheses

Key finding