Teach LLM agents by learning from their failed runs: collect failure trajectories, make failure-vs-success pairs, and fine-tune via DPO.

March 4, 20247 min

Overview

Decision SnapshotReady For Pilot

ETO uses simple offline preference fine-tuning, so it is practical and low-risk for improving SFT agents; it requires curated contrastive data and monitoring to avoid overfitting.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you run LLM agents in interactive tasks, adding a cheap offline loop that learns from the agent's failure cases can boost task reward and robustness to unseen variations without heavy online RL.

Who Should Care

Summary TLDR

ETO is a simple, iterative training loop for LLM agents: start from a behavioral-cloning (SFT) agent, let it explore to collect failed trajectories, pair those failures with expert successes, then fine-tune the model with a preference-based loss (DPO). Across three agent benchmarks (WebShop, ScienceWorld, ALFWorld) ETO consistently improves average reward and sample efficiency versus SFT and other baselines. ETO helps generalization to unseen task variants but needs a supply of diverse contrastive pairs and can overfit if iterated too many times.

Problem Statement

Open LLMs fine-tuned only on expert demonstrations (behavioral cloning) often fail to explore and generalize. The paper asks: can we improve agent policies by learning from the agent's own failed trajectories via contrastive preference learning?

Main Contribution

ETO algorithm: an iterative exploration + training loop that collects failure trajectories and learns from failure-success trajectory pairs using DPO preference loss.

Large-scale evaluation on three interactive agent datasets (WebShop, ScienceWorld, ALFWorld) showing consistent gains over SFT and strong baselines.

Key Findings

ETO improves average reward over SFT across agent benchmarks.

NumbersWebShop: 63.167.4 avg reward (Table 2)

Practical UseIf you already SFT an LLM agent, adding ETO can raise task rewards by several absolute points with modest extra tuning.

Evidence RefTable 2

ETO substantially boosts out-of-distribution performance on ScienceWorld.

NumbersScienceWorld (Unseen): 53.065.0 avg reward (+12.0 abs, ~22.6% rel)

Practical UseWhen you need better generalization to unseen variations, learning from failures yields large gains versus pure imitation.

Evidence RefTable 2; paper text reports ~22%

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average Reward (WebShop)SFT 63.1 → ETO 67.4SFT+4.3WebShop testTable 2; WebShop avg reward valuesTable 2
Average Reward (ScienceWorld Seen)SFT 67.4 → ETO 73.8SFT+6.4ScienceWorld SeenTable 2; ScienceWorld seen numbersTable 2

What To Try In 7 Days

Train a base SFT agent from your expert traces.

Run the agent on training tasks and log failed trajectories.

Construct failure-vs-success trajectory pairs and fine-tune using DPO for 1–2 iterations and validate on held-out variants.

Agent Features

Memory
Short-term trajectory context in prompts
Planning
ReAct-style planning with CoT (reasoning before actions)
Tool Use
Environment APIs (WebShop/ScienceWorld/ALFWorld)
Frameworks
LoRA
Is Agentic

Yes

Architectures
LLM-based policy (auto-regressive)

Optimization Features

Infra Optimization
Experiments run on 8× A100 80GB GPUs
Training Optimization
DPO preference loss (offline) instead of online RL

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

WebShop, ScienceWorld, ALFWorld (public datasets referenced in paper)

Risks & Boundaries

Limitations

ETO assumes reward differences suffice to mark entire trajectories as good/bad; action-wise blame is not identified by default.

Contrastive data diversity is limited by fixed expert traces and exploration scope, causing overfitting after several iterations.

When Not To Use

You lack any reliable reward signal or only have coarse binary rewards and cannot generate diverse contrasts.

You cannot provide at least a bootstrap policy (SFT or strong prior) before exploration.

Failure Modes

Overfitting to limited failure-success pairs across iterations, causing performance decline after 2–3 rounds.

Step-wise contrastive learning can be unstable because final reward may poorly reflect single action quality.

Core Entities

Models

Llama-2-7B-ChatLlama-2-13B-ChatMistral-7BGPT-4GPT-3.5-Turbo

Metrics

Average RewardSuccess Rate

Datasets

WebShopScienceWorldALFWorld

Benchmarks

ScienceWorld-UnseenWebShop (standard split)ALFWorld (seen/unseen)