Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.
Summary TLDR
Learn-by-interact is a data-first pipeline that makes LLM agents adapt to real apps without human labeling. It (1) uses documentation and self-instruct to generate tasks, (2) runs LLMs to create long interaction trajectories, (3) applies “backward construction” to turn sub-trajectories into aligned instructions, (4) filters with an LLM committee, and (5) uses observation- and model-based retrieval at inference. On four realistic benchmarks it raises ICL scores (e.g., Claude-3.5 from 12.4 to 22.5 on OSWorld) and gives up to +19.5 pp after fine-tuning (Codestral-22B on WebArena). The method trades synthesis cost for lower inference latency and higher task success.
Problem Statement
Realistic agent tasks need environment-specific examples, but human labeling of long, multi-step interactions is slow and costly. Off-the-shelf LLM outputs often produce trajectories that do not match original instructions, so we need an automated pipeline to produce many high-quality, aligned instruction–trajectory pairs for both in-context and training-based adaptation.
Main Contribution
A fully automated data pipeline that synthesizes environment-specific agent trajectories from documentation and LLM rollouts without human labels.
Backward construction: create new, aligned task instructions from sub-trajectories to fix misaligned or failed long runs and multiply usable examples.
Agentic retrieval for ICL: combine observation-based and model-based retrieval to pick relevant trajectory examples at each step.
Extensive evaluation showing consistent gains across four realistic agent benchmarks and analysis on data granularity, retrieval, scaling, and inference efficiency.
Key Findings
In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.
Fine-tuning on synthesized data can give very large gains; Codestral-22B on WebArena jumps from 4.7 to 24.2 after training.
Backward construction meaningfully raises data quality and training utility, providing up to ~14% extra improvement in some training settings.
Learn-by-interact reduces inference work compared to heavy planning methods: LATS uses ~4× more tokens per instance, while Learn-by-interact needs fewer LLM calls and only slightly more tokens than baseline.
Results
ICL task success (Claude-3.5-sonnet) on OSWorld
ICL task success (Claude-3.5-sonnet) on WebArena
Fine-tuned task success (Codestral-22B) on WebArena
Who Should Care
What To Try In 7 Days
Generate a small seed of tasks from your product docs using self-instruct and an LLM.
Roll out the LLM against the live or simulated environment, then apply backward construction to convert sub-trajectories into aligned examples.
Implement observation+model-based retrieval and test ICL performance before committing to fine-tuning.
Agent Features
Memory
- short-term interaction history (H) kept per episode
Planning
- multi-step interaction driven by LLM
Tool Use
- observation-based retrieval
- model-based retrieval
- action prediction prompts
Frameworks
- self-instruct
- backward construction
- agentic retrieval
Is Agentic
true
Architectures
- LLM-driven single agent
Optimization Features
Token Efficiency
- fewer LLM calls at inference compared to heavy planning baselines
Model Optimization
- LoRA
System Optimization
- precompute and index trajectory examples for fast retrieval
Training Optimization
- filtering with LLM committee to raise label quality
Inference Optimization
- reduce online planning by retrieving pre-synthesized examples
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Data synthesis and LLM committee filtering require many LLM calls and compute.
- Method relies on availability and coverage of documentation or related resources for task generation.
- Generated trajectories can still contain wrong actions; filtering mitigates but may not remove all noise.
When Not To Use
- When you cannot afford large upfront LLM generation and filtering cost.
- When environment documentation is missing or misleading.
- When real interactions are extremely costly or dangerous (e.g., physical robots) without reliable simulators.
Failure Modes
- LLM rollouts choose incorrect actions, producing misaligned trajectories before backward construction.
- Trajectories that loop or detour lead to poor examples if filtering misses them.
- Retrieval mismatch: retrieved example states may appear similar but require different subsequent actions.
Core Entities
Models
- Claude-3.5-sonnet
- Gemini-1.5-pro
- Codestral-22B
- Codegemma-7B
Metrics
- task success %
- Accuracy
- LLM judge fuzzy match
Datasets
- SWE-bench
- WebArena
- OSWorld
- Spider2-V
Benchmarks
- SWE-bench
- WebArena
- OSWorld
- Spider2-V

