Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
AutoRefine automates turning past successful runs into reusable procedures and small specialized agents, which lowers action counts and improves success on complex planning—reducing latency and operation cost for task-heavy, tool-using products.
Summary TLDR
AutoRefine is a pipeline that turns successful agent execution traces into two reusable artifacts: skill patterns (text or code guidelines) and subagent patterns (small specialized agents with memory). It extracts patterns in batches, retrieves them with multi-query embeddings, and maintains the repository by scoring, pruning (bottom 20%), and merging similar entries. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, it improves success rates (98.4%, 70.4%, 27.1%) and cuts steps by 20–73%. The system is best for complex procedural tasks where encapsulating stateful subtasks helps reuse.
Problem Statement
LLM agents fail to accumulate procedural knowledge: past work stores flat text or raw trajectories that cannot capture multi-step logic or state. As experience accumulates, repositories become large and noisy because there is no active maintenance, which hurts retrieval and performance.
Main Contribution
Dual-pattern extraction: automatically create skill patterns (guidelines or code) and subagent patterns (small specialized agents with memory) from execution traces.
Batch, contrastive extraction: extract patterns from groups of successful vs failed trajectories to find generalizable strategies.
Continuous maintenance: score patterns by empirical utility, prune low-utility ones (bottom 20%), and merge redundant patterns to prevent repository bloat.
Key Findings
AutoRefine improves success rates across diverse benchmarks.
AutoRefine reduces action steps substantially versus reflection-based baselines.
All three components (subagents, batch extraction, maintenance) materially matter.
Automatic subagents capture universal procedural constraints better than manual agents on some tasks.
Results
ALFWorld success rate (SR)
ScienceWorld pass rate
TravelPlanner final pass (test)
Step reduction vs ReAct+Reflexion
Repository growth without maintenance
Who Should Care
What To Try In 7 Days
Run a small AutoRefine pipeline on your recent agent logs: extract patterns every K=10 tasks and measure step count and success change.
Implement maintenance: score patterns by success-rate, usage, and precision and prune bottom 20% to control repository size.
Seed one human-designed subagent for a hard subtask and compare cold-start gains (as in AttractionPlanner experiment).
Agent Features
Memory
- subagent local memory (stateful)
- metadata-driven repository usage stats
Planning
- pattern-augmented planning
- multi-query retrieval planning
Tool Use
- register code snippets as callable tools
- invoke subagents as callable procedures
Frameworks
- AutoRefine pattern repository
- contrastive batch extraction agents
Is Agentic
true
Architectures
- hierarchical delegation
- pattern repository + retrieval
Collaboration
- master coordinator delegates to subagents
Optimization Features
Token Efficiency
- reduces redundant reasoning steps by delegating to subagents
Infra Optimization
- Elasticsearch + embedding-based retrieval to scale pattern lookup
System Optimization
- percentile-based pruning (bottom 20%) to limit repository growth
Reproducibility
Data Urls
- ALFWorld (Shridhar et al.), ScienceWorld (Wang et al.), TravelPlanner (Xie et al.)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Pattern type classification can mislabel strategies, creating unnecessary subagents or missing stateful logic (A.2).
- Pattern contexts can be too broad or too narrow, causing low retrieval precision or missed opportunities (A.2).
- Hyperparameters are domain-sensitive (batch size K, pruning α, retrieval k, threshold θ) and need tuning for new domains (C.2).
- Cold-start: requires many initial tasks to build useful patterns; pre-seeding helps but reduces autonomy (C.1).
When Not To Use
- Small, static tasks where single-shot LLM reasoning suffices and pattern overhead is unnecessary.
- Privacy-sensitive domains where execution traces cannot be safely stored or patterns might leak sensitive data.
- Very low compute/latency budgets where extra retrieval or subagent calls would add unacceptable overhead.
Failure Modes
- Misclassified patterns lead to many subagent calls and invocation overhead.
- No maintenance: repository bloat causes low utilization and degraded retrieval quality.
- Overfitting when extracting from single-task traces instead of batch contrastive analysis.
- Incorrectly merged patterns that conflate incompatible procedures.
Core Entities
Models
- Claude-sonnet-4
- GPT-4-turbo
- Qwen3-Embedding-4B
Metrics
- Success Rate (SR / Pass@1 / Final Pass Rate)
- Steps (mean)
- Repository size (patterns)
- Pattern utilization rate (u_j / r_j)
Datasets
- ALFWorld
- ScienceWorld
- TravelPlanner
Benchmarks
- ALFWorld
- ScienceWorld
- TravelPlanner
Context Entities
Models
- GPT-4-turbo (used for ALFWorld comparisons)

