AutoRefine: automatically extract reusable skills and subagents from past runs to continually improve LLM agents

January 30, 20268 min

Overview

Decision SnapshotNeeds Validation

The approach is engineering-ready for research and pilot production: it shows clear, repeatable gains on public benchmarks but requires tuning (K, k, θ) and careful auditing of extracted patterns before wide deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 65%

Authors

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Shuo Tang

Links

Abstract / PDF / Data

Why It Matters For Business

AutoRefine automates turning past successful runs into reusable procedures and small specialized agents, which lowers action counts and improves success on complex planning—reducing latency and operation cost for task-heavy, tool-using products.

Who Should Care

Summary TLDR

AutoRefine is a pipeline that turns successful agent execution traces into two reusable artifacts: skill patterns (text or code guidelines) and subagent patterns (small specialized agents with memory). It extracts patterns in batches, retrieves them with multi-query embeddings, and maintains the repository by scoring, pruning (bottom 20%), and merging similar entries. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, it improves success rates (98.4%, 70.4%, 27.1%) and cuts steps by 20–73%. The system is best for complex procedural tasks where encapsulating stateful subtasks helps reuse.

Problem Statement

LLM agents fail to accumulate procedural knowledge: past work stores flat text or raw trajectories that cannot capture multi-step logic or state. As experience accumulates, repositories become large and noisy because there is no active maintenance, which hurts retrieval and performance.

Main Contribution

Dual-pattern extraction: automatically create skill patterns (guidelines or code) and subagent patterns (small specialized agents with memory) from execution traces.

Batch, contrastive extraction: extract patterns from groups of successful vs failed trajectories to find generalizable strategies.

Key Findings

AutoRefine improves success rates across diverse benchmarks.

NumbersALFWorld 98.4% ±1.5, ScienceWorld 70.4% ±1.9, TravelPlanner 27.1% ±2.4

Practical UseUse AutoRefine to boost agent success on tasks that benefit from reusable procedures; expect largest gains on domains with procedural structure.

Evidence RefTable 1; §4.2

AutoRefine reduces action steps substantially versus reflection-based baselines.

NumbersSteps reduced by 20.5% (ALFWorld), 59.0% (ScienceWorld), 72.8% (TravelPlanner)

Practical UseIf your system pays per-action or latency cost, adding pattern extraction can cut operational steps and cost, especially on hard planning tasks.

Evidence RefTable 1; §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ALFWorld success rate (SR)98.4% ±1.5ReAct+Reflexion 95.5% ±1.9+2.9% absoluteALFWorld (test)Table 1, §4.2Table 1
ScienceWorld pass rate70.4% ±1.9ReAct+Reflexion 69.2% ±2.1+1.2% absoluteScienceWorld (selected tasks)Table 1, §4.2Table 1

What To Try In 7 Days

Run a small AutoRefine pipeline on your recent agent logs: extract patterns every K=10 tasks and measure step count and success change.

Implement maintenance: score patterns by success-rate, usage, and precision and prune bottom 20% to control repository size.

Seed one human-designed subagent for a hard subtask and compare cold-start gains (as in AttractionPlanner experiment).

Agent Features

Memory
subagent local memory (stateful)metadata-driven repository usage stats
Planning
pattern-augmented planningmulti-query retrieval planning
Tool Use
register code snippets as callable toolsinvoke subagents as callable procedures
Frameworks
AutoRefine pattern repositorycontrastive batch extraction agents
Is Agentic

Yes

Architectures
hierarchical delegationpattern repository + retrieval
Collaboration
master coordinator delegates to subagents

Optimization Features

Token Efficiency
reduces redundant reasoning steps by delegating to subagents
Infra Optimization
Elasticsearch + embedding-based retrieval to scale pattern lookup
System Optimization
percentile-based pruning (bottom 20%) to limit repository growth

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

ALFWorld (Shridhar et al.), ScienceWorld (Wang et al.), TravelPlanner (Xie et al.)

Risks & Boundaries

Limitations

Pattern type classification can mislabel strategies, creating unnecessary subagents or missing stateful logic (A.2).

Pattern contexts can be too broad or too narrow, causing low retrieval precision or missed opportunities (A.2).

When Not To Use

Small, static tasks where single-shot LLM reasoning suffices and pattern overhead is unnecessary.

Privacy-sensitive domains where execution traces cannot be safely stored or patterns might leak sensitive data.

Failure Modes

Misclassified patterns lead to many subagent calls and invocation overhead.

No maintenance: repository bloat causes low utilization and degraded retrieval quality.

Core Entities

Models

Claude-sonnet-4GPT-4-turboQwen3-Embedding-4B

Metrics

Success Rate (SR / Pass@1 / Final Pass Rate)Steps (mean)Repository size (patterns)Pattern utilization rate (u_j / r_j)

Datasets

ALFWorldScienceWorldTravelPlanner

Benchmarks

ALFWorldScienceWorldTravelPlanner

Context Entities

Models

GPT-4-turbo (used for ALFWorld comparisons)