Synthesize agent–environment trajectories and rewrite tasks (backward construction) to adapt LLM agents without human labels

January 18, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık

Links

Abstract / PDF

Why It Matters For Business

You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.

Summary TLDR

Learn-by-interact is a data-first pipeline that makes LLM agents adapt to real apps without human labeling. It (1) uses documentation and self-instruct to generate tasks, (2) runs LLMs to create long interaction trajectories, (3) applies “backward construction” to turn sub-trajectories into aligned instructions, (4) filters with an LLM committee, and (5) uses observation- and model-based retrieval at inference. On four realistic benchmarks it raises ICL scores (e.g., Claude-3.5 from 12.4 to 22.5 on OSWorld) and gives up to +19.5 pp after fine-tuning (Codestral-22B on WebArena). The method trades synthesis cost for lower inference latency and higher task success.

Problem Statement

Realistic agent tasks need environment-specific examples, but human labeling of long, multi-step interactions is slow and costly. Off-the-shelf LLM outputs often produce trajectories that do not match original instructions, so we need an automated pipeline to produce many high-quality, aligned instruction–trajectory pairs for both in-context and training-based adaptation.

Main Contribution

A fully automated data pipeline that synthesizes environment-specific agent trajectories from documentation and LLM rollouts without human labels.

Backward construction: create new, aligned task instructions from sub-trajectories to fix misaligned or failed long runs and multiply usable examples.

Agentic retrieval for ICL: combine observation-based and model-based retrieval to pick relevant trajectory examples at each step.

Extensive evaluation showing consistent gains across four realistic agent benchmarks and analysis on data granularity, retrieval, scaling, and inference efficiency.

Key Findings

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Numbers12.4 → 22.5 (OSWorld, Claude ICL)

Fine-tuning on synthesized data can give very large gains; Codestral-22B on WebArena jumps from 4.7 to 24.2 after training.

Numbers4.7 → 24.2 (+19.5) WebArena

Backward construction meaningfully raises data quality and training utility, providing up to ~14% extra improvement in some training settings.

Numbersup to +14.0% (training)

Learn-by-interact reduces inference work compared to heavy planning methods: LATS uses ~4× more tokens per instance, while Learn-by-interact needs fewer LLM calls and only slightly more tokens than baseline.

NumbersLATS ≈ 4× tokens; Learn-by-interact fewer calls (avg across 4 benchmarks)

Results

ICL task success (Claude-3.5-sonnet) on OSWorld

Value22.5

Baseline12.4

ICL task success (Claude-3.5-sonnet) on WebArena

Value48.0

Baseline35.8

Fine-tuned task success (Codestral-22B) on WebArena

Value24.2

Baseline4.7

Who Should Care

What To Try In 7 Days

Generate a small seed of tasks from your product docs using self-instruct and an LLM.

Roll out the LLM against the live or simulated environment, then apply backward construction to convert sub-trajectories into aligned examples.

Implement observation+model-based retrieval and test ICL performance before committing to fine-tuning.

Agent Features

Memory

  • short-term interaction history (H) kept per episode

Planning

  • multi-step interaction driven by LLM

Tool Use

  • observation-based retrieval
  • model-based retrieval
  • action prediction prompts

Frameworks

  • self-instruct
  • backward construction
  • agentic retrieval

Is Agentic

true

Architectures

  • LLM-driven single agent

Optimization Features

Token Efficiency

  • fewer LLM calls at inference compared to heavy planning baselines

Model Optimization

  • LoRA

System Optimization

  • precompute and index trajectory examples for fast retrieval

Training Optimization

  • filtering with LLM committee to raise label quality

Inference Optimization

  • reduce online planning by retrieving pre-synthesized examples

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Data synthesis and LLM committee filtering require many LLM calls and compute.
  • Method relies on availability and coverage of documentation or related resources for task generation.
  • Generated trajectories can still contain wrong actions; filtering mitigates but may not remove all noise.

When Not To Use

  • When you cannot afford large upfront LLM generation and filtering cost.
  • When environment documentation is missing or misleading.
  • When real interactions are extremely costly or dangerous (e.g., physical robots) without reliable simulators.

Failure Modes

  • LLM rollouts choose incorrect actions, producing misaligned trajectories before backward construction.
  • Trajectories that loop or detour lead to poor examples if filtering misses them.
  • Retrieval mismatch: retrieved example states may appear similar but require different subsequent actions.

Core Entities

Models

  • Claude-3.5-sonnet
  • Gemini-1.5-pro
  • Codestral-22B
  • Codegemma-7B

Metrics

  • task success %
  • Accuracy
  • LLM judge fuzzy match

Datasets

  • SWE-bench
  • WebArena
  • OSWorld
  • Spider2-V

Benchmarks

  • SWE-bench
  • WebArena
  • OSWorld
  • Spider2-V