Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

June 6, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.8

Citation Count

0

Authors

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Links

Abstract / PDF

Why It Matters For Business

AgentSwift reduces the number of expensive full evaluations by using a learned performance predictor and uncertainty-aware search, letting teams find stronger multi-step LLM agents faster and at lower API cost.

Summary TLDR

AgentSwift automates agent design by searching a hierarchical space that combines workflow structure with plug-in components (memory, planning, tool use). It trains a lightweight value model on a purposely sampled 220-agent dataset to predict agent performance cheaply, then uses uncertainty-aware MCTS (recombination, mutation, refinement) to guide search. On seven benchmarks across embodied, math, web, tool, and game tasks, AgentSwift finds agents that outperform manual and prior automated methods, with an average gain of 8.34% on evaluated tasks and faster convergence under a 60-evaluation budget.

Problem Statement

Automated agent design is slow and costly because (1) prior searches only tweak workflows and ignore modular components like planning, memory, and tool use; (2) evaluating each candidate agent on real benchmarks is expensive (tens of dollars per eval), and (3) naive search strategies fail to explore a large joint design space efficiently.

Main Contribution

A hierarchical search space that jointly optimizes agentic workflow and composable components (memory, planning, tool use).

A lightweight value model (7B backbone + adapters) trained on a 220-sample dataset built with pairwise covering arrays and balanced Bayesian sampling to predict agent performance cheaply.

An uncertainty-guided hierarchical MCTS search (selection + recombination, mutation, refinement) that uses the value model and its prediction uncertainty to prioritize promising candidates.

Key Findings

Average performance improvement over baselines

Numbersavg +8.34% over baselines on seven benchmarks

Value model predictive accuracy

NumbersMSE 0.006, R² 0.8068, Spearman 0.9026 (AgentSwift mistral)

Small labeled dataset suffices for transfer

Numbers30 labeled examples approach oracle performance on M3ToolEval

Ablation impact of recombination

NumbersALFWorld drops 0.806 → 0.739 when recombination removed (−0.067)

Search budget and sample efficiency

NumbersEvaluation budget capped at 60 agents; AgentSwift finds top agents faster than baselines

Results

Average improvement

Value+8.34%

Baselinestate-of-the-art automated agent search + manual agents

ALFWorld success rate

Value0.806 ± 0.007

BaselineAgentSquare 0.701 ± 0.07

Accuracy

Value0.628 ± 0.000

BaselineMaAS 0.597 ± 0.001

Value model fit (surrogate)

ValueMSE 0.006, R² 0.8068, Spearman 0.9026

Baselinegpt-4o few-shot: MSE 0.0162, R² 0.4793

Who Should Care

What To Try In 7 Days

Run AgentSwift codebase on a single internal task with a 60-evaluation cap to compare against your hand-tuned agent.

Train the provided 7B value model on a small set (≈30–50) of labeled agents for your task to get early surrogate-guided search.

Enable recombination + mutation stages; ablations show recombination is most important for gains.

Agent Features

Memory

  • Memory component with prompt, temperature and backend
  • Vector DB-style backends used in examples

Planning

  • Planning component as prompt + temperature
  • Adaptive hierarchical planning included in discovered agents

Tool Use

  • Tool invocation prompt + toolset
  • Search optimizes which tool modules to attach

Frameworks

  • MCTS-driven search
  • Value-model surrogate
  • Balanced Bayesian sampling for dataset construction

Is Agentic

true

Architectures

  • Hierarchical workflow + components
  • LLM-invoking node graphs
  • Composable plugins (memory, planning, tool use)

Optimization Features

Token Efficiency

  • Not explicitly quantified

Infra Optimization

  • Value model trained on 3 A100 GPUs (reported)

Model Optimization

  • Lightweight adapter modules on 7B backbones

System Optimization

  • Uncertainty-aware selection reduces wasted full evaluations

Training Optimization

  • SFT
  • Dataset built with pairwise covering array + Balanced-Extreme Bayesian sampling

Inference Optimization

  • Value model used as cheap surrogate to score candidates instead of full LLM evals

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Proposer steps (recombination/mutation/refinement) rely on manual LLM prompts rather than a learned proposer.
  • Value model needs labeled agent evaluations (they use 220 samples); collecting labels can still be costly in specialized domains.
  • Search optimizes static agent designs; it does not handle online dynamic adaptation during deployment.

When Not To Use

  • If you need agents that adapt in real time during deployment, this static-design search is not sufficient.
  • When you cannot provide any labeled evaluations for your task; the value model requires at least a small labeled set for good predictions.
  • If you cannot run a small surrogate model (7B backbone + adapters) due to hardware limits

Failure Modes

  • Heuristic proposers may miss high-quality module implementations that a learned generator could find.
  • Value model miscalibration can bias search toward overpromised candidates if uncertainty estimation is poor.
  • Search may still explore subspaces that require expensive, domain-specific evaluations not covered by the 220-sample training set.

Core Entities

Models

  • gpt-4o
  • gpt-4o-mini
  • DeepSeek-v3
  • Mistral-7B-v0.3
  • Qwen2.5-7B
  • AgentSwift value model (7B + adapters)

Metrics

  • Success Rate
  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)
  • Spearman rank correlation

Datasets

  • ALFWorld
  • ScienceWorld
  • MATH (617-problem subset)
  • WebShop
  • M3ToolEval
  • TravelPlanner
  • PDDL

Benchmarks

  • Embodied: ALFWorld, ScienceWorld
  • Math: MATH subset
  • Web: WebShop
  • Tool: M3ToolEval, TravelPlanner
  • Game: PDDL