Synthesize agent–environment trajectories and rewrite tasks (backward construction) to adapt LLM agents without human labels

Overview

Decision SnapshotNeeds Validation

The pipeline shows consistent gains on four realistic benchmarks using multiple commercial and open models; synthesis cost is the main trade-off but results are reproducible in similar setups.

Citations2

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık

Links

Abstract / PDF

Why It Matters For Business

You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.

Who Should Care

ML Engineer Product Manager Engineering Lead CTO Founder

Summary TLDR

Learn-by-interact is a data-first pipeline that makes LLM agents adapt to real apps without human labeling. It (1) uses documentation and self-instruct to generate tasks, (2) runs LLMs to create long interaction trajectories, (3) applies “backward construction” to turn sub-trajectories into aligned instructions, (4) filters with an LLM committee, and (5) uses observation- and model-based retrieval at inference. On four realistic benchmarks it raises ICL scores (e.g., Claude-3.5 from 12.4 to 22.5 on OSWorld) and gives up to +19.5 pp after fine-tuning (Codestral-22B on WebArena). The method trades synthesis cost for lower inference latency and higher task success.

Problem Statement

Realistic agent tasks need environment-specific examples, but human labeling of long, multi-step interactions is slow and costly. Off-the-shelf LLM outputs often produce trajectories that do not match original instructions, so we need an automated pipeline to produce many high-quality, aligned instruction–trajectory pairs for both in-context and training-based adaptation.

Main Contribution

A fully automated data pipeline that synthesizes environment-specific agent trajectories from documentation and LLM rollouts without human labels.

Backward construction: create new, aligned task instructions from sub-trajectories to fix misaligned or failed long runs and multiply usable examples.

Key Findings

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Numbers12.4 → 22.5 (OSWorld, Claude ICL)

Practical UseIf you only have API access, pre-synthesizing environment-specific examples and using agentic retrieval can roughly double task success on some real-world environments.

Evidence RefTable 2

Fine-tuning on synthesized data can give very large gains; Codestral-22B on WebArena jumps from 4.7 to 24.2 after training.

Numbers4.7 → 24.2 (+19.5) WebArena

Practical UseWhen you can update model weights, invest in synthesized data for environment-specific fine-tuning — it can give order-of-magnitude improvements over base models.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ICL task success (Claude-3.5-sonnet) on OSWorld	22.5	12.4	+10.1	OSWorld	Learn-by-interact ICL results	Table 2
ICL task success (Claude-3.5-sonnet) on WebArena	48.0	35.8	+12.2	WebArena	Learn-by-interact ICL results	Table 2

What To Try In 7 Days

Generate a small seed of tasks from your product docs using self-instruct and an LLM.

Roll out the LLM against the live or simulated environment, then apply backward construction to convert sub-trajectories into aligned examples.

Implement observation+model-based retrieval and test ICL performance before committing to fine-tuning.

Agent Features

Memory

short-term interaction history (H) kept per episode

Planning

multi-step interaction driven by LLM

Tool Use

observation-based retrievalmodel-based retrievalaction prediction prompts

Frameworks

self-instructbackward constructionagentic retrieval

Is Agentic

Yes

Architectures

LLM-driven single agent

Optimization Features

Token Efficiency

fewer LLM calls at inference compared to heavy planning baselines

Model Optimization

LoRA

System Optimization

precompute and index trajectory examples for fast retrieval

Training Optimization

filtering with LLM committee to raise label quality

Inference Optimization

reduce online planning by retrieving pre-synthesized examples

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Data synthesis and LLM committee filtering require many LLM calls and compute.

Method relies on availability and coverage of documentation or related resources for task generation.

When Not To Use

When you cannot afford large upfront LLM generation and filtering cost.

When environment documentation is missing or misleading.

Failure Modes

LLM rollouts choose incorrect actions, producing misaligned trajectories before backward construction.

Trajectories that loop or detour lead to poor examples if filtering misses them.

Core Entities

Models

Claude-3.5-sonnetGemini-1.5-proCodestral-22BCodegemma-7B

Metrics

task success %AccuracyLLM judge fuzzy match

Datasets

SWE-benchWebArenaOSWorldSpider2-V

Benchmarks

SWE-benchWebArenaOSWorldSpider2-V

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Fine-tuning on synthesized data can give very large gains; Codestral-22B on WebArena jumps from 4.7 to 24.2 after training.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding