Synthesize agent–environment trajectories and rewrite tasks (backward construction) to adapt LLM agents without human labels

January 18, 20258 min

Overview

Decision SnapshotNeeds Validation

The pipeline shows consistent gains on four realistic benchmarks using multiple commercial and open models; synthesis cost is the main trade-off but results are reproducible in similar setups.

Citations2

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık

Links

Abstract / PDF

Why It Matters For Business

You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.

Who Should Care

Summary TLDR

Learn-by-interact is a data-first pipeline that makes LLM agents adapt to real apps without human labeling. It (1) uses documentation and self-instruct to generate tasks, (2) runs LLMs to create long interaction trajectories, (3) applies “backward construction” to turn sub-trajectories into aligned instructions, (4) filters with an LLM committee, and (5) uses observation- and model-based retrieval at inference. On four realistic benchmarks it raises ICL scores (e.g., Claude-3.5 from 12.4 to 22.5 on OSWorld) and gives up to +19.5 pp after fine-tuning (Codestral-22B on WebArena). The method trades synthesis cost for lower inference latency and higher task success.

Problem Statement

Realistic agent tasks need environment-specific examples, but human labeling of long, multi-step interactions is slow and costly. Off-the-shelf LLM outputs often produce trajectories that do not match original instructions, so we need an automated pipeline to produce many high-quality, aligned instruction–trajectory pairs for both in-context and training-based adaptation.

Main Contribution

A fully automated data pipeline that synthesizes environment-specific agent trajectories from documentation and LLM rollouts without human labels.

Backward construction: create new, aligned task instructions from sub-trajectories to fix misaligned or failed long runs and multiply usable examples.

Key Findings

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Numbers12.422.5 (OSWorld, Claude ICL)

Practical UseIf you only have API access, pre-synthesizing environment-specific examples and using agentic retrieval can roughly double task success on some real-world environments.

Evidence RefTable 2

Fine-tuning on synthesized data can give very large gains; Codestral-22B on WebArena jumps from 4.7 to 24.2 after training.

Numbers4.724.2 (+19.5) WebArena

Practical UseWhen you can update model weights, invest in synthesized data for environment-specific fine-tuning — it can give order-of-magnitude improvements over base models.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ICL task success (Claude-3.5-sonnet) on OSWorld22.512.4+10.1OSWorldLearn-by-interact ICL resultsTable 2
ICL task success (Claude-3.5-sonnet) on WebArena48.035.8+12.2WebArenaLearn-by-interact ICL resultsTable 2

What To Try In 7 Days

Generate a small seed of tasks from your product docs using self-instruct and an LLM.

Roll out the LLM against the live or simulated environment, then apply backward construction to convert sub-trajectories into aligned examples.

Implement observation+model-based retrieval and test ICL performance before committing to fine-tuning.

Agent Features

Memory
short-term interaction history (H) kept per episode
Planning
multi-step interaction driven by LLM
Tool Use
observation-based retrievalmodel-based retrievalaction prediction prompts
Frameworks
self-instructbackward constructionagentic retrieval
Is Agentic

Yes

Architectures
LLM-driven single agent

Optimization Features

Token Efficiency
fewer LLM calls at inference compared to heavy planning baselines
Model Optimization
LoRA
System Optimization
precompute and index trajectory examples for fast retrieval
Training Optimization
filtering with LLM committee to raise label quality
Inference Optimization
reduce online planning by retrieving pre-synthesized examples

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Data synthesis and LLM committee filtering require many LLM calls and compute.

Method relies on availability and coverage of documentation or related resources for task generation.

When Not To Use

When you cannot afford large upfront LLM generation and filtering cost.

When environment documentation is missing or misleading.

Failure Modes

LLM rollouts choose incorrect actions, producing misaligned trajectories before backward construction.

Trajectories that loop or detour lead to poor examples if filtering misses them.

Core Entities

Models

Claude-3.5-sonnetGemini-1.5-proCodestral-22BCodegemma-7B

Metrics

task success %AccuracyLLM judge fuzzy match

Datasets

SWE-benchWebArenaOSWorldSpider2-V

Benchmarks

SWE-benchWebArenaOSWorldSpider2-V