AutoRefine: automatically extract reusable skills and subagents from past runs to continually improve LLM agents

January 30, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.6

Citation Count

0

Authors

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Shuo Tang

Links

Abstract / PDF

Why It Matters For Business

AutoRefine automates turning past successful runs into reusable procedures and small specialized agents, which lowers action counts and improves success on complex planning—reducing latency and operation cost for task-heavy, tool-using products.

Summary TLDR

AutoRefine is a pipeline that turns successful agent execution traces into two reusable artifacts: skill patterns (text or code guidelines) and subagent patterns (small specialized agents with memory). It extracts patterns in batches, retrieves them with multi-query embeddings, and maintains the repository by scoring, pruning (bottom 20%), and merging similar entries. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, it improves success rates (98.4%, 70.4%, 27.1%) and cuts steps by 20–73%. The system is best for complex procedural tasks where encapsulating stateful subtasks helps reuse.

Problem Statement

LLM agents fail to accumulate procedural knowledge: past work stores flat text or raw trajectories that cannot capture multi-step logic or state. As experience accumulates, repositories become large and noisy because there is no active maintenance, which hurts retrieval and performance.

Main Contribution

Dual-pattern extraction: automatically create skill patterns (guidelines or code) and subagent patterns (small specialized agents with memory) from execution traces.

Batch, contrastive extraction: extract patterns from groups of successful vs failed trajectories to find generalizable strategies.

Continuous maintenance: score patterns by empirical utility, prune low-utility ones (bottom 20%), and merge redundant patterns to prevent repository bloat.

Key Findings

AutoRefine improves success rates across diverse benchmarks.

NumbersALFWorld 98.4% ±1.5, ScienceWorld 70.4% ±1.9, TravelPlanner 27.1% ±2.4

AutoRefine reduces action steps substantially versus reflection-based baselines.

NumbersSteps reduced by 20.5% (ALFWorld), 59.0% (ScienceWorld), 72.8% (TravelPlanner)

All three components (subagents, batch extraction, maintenance) materially matter.

NumbersRemoving subagents drops final pass by 22.3% (absolute); batch extraction loss ≈17.3%; no maintenance yields 4.5× repo,

Automatic subagents capture universal procedural constraints better than manual agents on some tasks.

NumbersTravelPlanner commonsense macro: 37.9% (Ours) vs 15.59% (ATLAS manual)

Results

ALFWorld success rate (SR)

Value98.4% ±1.5

BaselineReAct+Reflexion 95.5% ±1.9

ScienceWorld pass rate

Value70.4% ±1.9

BaselineReAct+Reflexion 69.2% ±2.1

TravelPlanner final pass (test)

Value27.1% ±2.4

BaselineATLAS manual 12.1% (reported)

Step reduction vs ReAct+Reflexion

ValueALFWorld 20.5% | ScienceWorld 59.0% | TravelPlanner 72.8%

BaselineReAct+Reflexion steps (16.1, 40.2, 80.2 respectively)

Repository growth without maintenance

Value4.5× larger repository; utilization drops 8.9×

BaselineWith maintenance

Who Should Care

What To Try In 7 Days

Run a small AutoRefine pipeline on your recent agent logs: extract patterns every K=10 tasks and measure step count and success change.

Implement maintenance: score patterns by success-rate, usage, and precision and prune bottom 20% to control repository size.

Seed one human-designed subagent for a hard subtask and compare cold-start gains (as in AttractionPlanner experiment).

Agent Features

Memory

  • subagent local memory (stateful)
  • metadata-driven repository usage stats

Planning

  • pattern-augmented planning
  • multi-query retrieval planning

Tool Use

  • register code snippets as callable tools
  • invoke subagents as callable procedures

Frameworks

  • AutoRefine pattern repository
  • contrastive batch extraction agents

Is Agentic

true

Architectures

  • hierarchical delegation
  • pattern repository + retrieval

Collaboration

  • master coordinator delegates to subagents

Optimization Features

Token Efficiency

  • reduces redundant reasoning steps by delegating to subagents

Infra Optimization

  • Elasticsearch + embedding-based retrieval to scale pattern lookup

System Optimization

  • percentile-based pruning (bottom 20%) to limit repository growth

Reproducibility

Data Urls

  • ALFWorld (Shridhar et al.), ScienceWorld (Wang et al.), TravelPlanner (Xie et al.)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Pattern type classification can mislabel strategies, creating unnecessary subagents or missing stateful logic (A.2).
  • Pattern contexts can be too broad or too narrow, causing low retrieval precision or missed opportunities (A.2).
  • Hyperparameters are domain-sensitive (batch size K, pruning α, retrieval k, threshold θ) and need tuning for new domains (C.2).
  • Cold-start: requires many initial tasks to build useful patterns; pre-seeding helps but reduces autonomy (C.1).

When Not To Use

  • Small, static tasks where single-shot LLM reasoning suffices and pattern overhead is unnecessary.
  • Privacy-sensitive domains where execution traces cannot be safely stored or patterns might leak sensitive data.
  • Very low compute/latency budgets where extra retrieval or subagent calls would add unacceptable overhead.

Failure Modes

  • Misclassified patterns lead to many subagent calls and invocation overhead.
  • No maintenance: repository bloat causes low utilization and degraded retrieval quality.
  • Overfitting when extracting from single-task traces instead of batch contrastive analysis.
  • Incorrectly merged patterns that conflate incompatible procedures.

Core Entities

Models

  • Claude-sonnet-4
  • GPT-4-turbo
  • Qwen3-Embedding-4B

Metrics

  • Success Rate (SR / Pass@1 / Final Pass Rate)
  • Steps (mean)
  • Repository size (patterns)
  • Pattern utilization rate (u_j / r_j)

Datasets

  • ALFWorld
  • ScienceWorld
  • TravelPlanner

Benchmarks

  • ALFWorld
  • ScienceWorld
  • TravelPlanner

Context Entities

Models

  • GPT-4-turbo (used for ALFWorld comparisons)