Teach LLM agents by learning from their failed runs: collect failure trajectories, make failure-vs-success pairs, and fine-tune via DPO.

March 4, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

Links

Abstract / PDF

Why It Matters For Business

If you run LLM agents in interactive tasks, adding a cheap offline loop that learns from the agent's failure cases can boost task reward and robustness to unseen variations without heavy online RL.

Summary TLDR

ETO is a simple, iterative training loop for LLM agents: start from a behavioral-cloning (SFT) agent, let it explore to collect failed trajectories, pair those failures with expert successes, then fine-tune the model with a preference-based loss (DPO). Across three agent benchmarks (WebShop, ScienceWorld, ALFWorld) ETO consistently improves average reward and sample efficiency versus SFT and other baselines. ETO helps generalization to unseen task variants but needs a supply of diverse contrastive pairs and can overfit if iterated too many times.

Problem Statement

Open LLMs fine-tuned only on expert demonstrations (behavioral cloning) often fail to explore and generalize. The paper asks: can we improve agent policies by learning from the agent's own failed trajectories via contrastive preference learning?

Main Contribution

ETO algorithm: an iterative exploration + training loop that collects failure trajectories and learns from failure-success trajectory pairs using DPO preference loss.

Large-scale evaluation on three interactive agent datasets (WebShop, ScienceWorld, ALFWorld) showing consistent gains over SFT and strong baselines.

Analysis: ETO improves out-of-distribution generalization and action efficiency, but benefits plateau or decline after a few iterations and can fail without initial SFT.

Key Findings

ETO improves average reward over SFT across agent benchmarks.

NumbersWebShop: 63.1 → 67.4 avg reward (Table 2)

ETO substantially boosts out-of-distribution performance on ScienceWorld.

NumbersScienceWorld (Unseen): 53.0 → 65.0 avg reward (+12.0 abs, ~22.6% rel)

ETO improves action efficiency: reaches higher rewards in fewer steps.

NumbersCase plots show ETO reaches full score earlier than SFT on ScienceWorld examples (Figure 3)

ETO without an initial SFT baseline fails; combining RFT then ETO works best when experts are missing.

NumbersSelf-play: untuned Llama + ETO = 12.5 vs untuned Llama + RFT = 48.4; RFT+ETO = 51.2 (Table 5)

Multiple ETO iterations can help at first but then overfit.

NumbersPerformance improves over first 1–2 iterations, then declines after the 3rd (Figure 4)

Results

Average Reward (WebShop)

ValueSFT 63.1 → ETO 67.4

BaselineSFT

Average Reward (ScienceWorld Seen)

ValueSFT 67.4 → ETO 73.8

BaselineSFT

Average Reward (ScienceWorld Unseen)

ValueSFT 53.0 → ETO 65.0

BaselineSFT

Average Reward (ALFWorld Seen)

ValueSFT 60.0 → ETO 68.6

BaselineSFT

Self-play without BC (WebShop)

ValueUntuned Llama + ETO 12.5; RFT 48.4; RFT+ETO 51.2

BaselineUntuned Llama

Who Should Care

What To Try In 7 Days

Train a base SFT agent from your expert traces.

Run the agent on training tasks and log failed trajectories.

Construct failure-vs-success trajectory pairs and fine-tune using DPO for 1–2 iterations and validate on held-out variants.

Agent Features

Memory

  • Short-term trajectory context in prompts

Planning

  • ReAct-style planning with CoT (reasoning before actions)

Tool Use

  • Environment APIs (WebShop/ScienceWorld/ALFWorld)

Frameworks

  • LoRA

Is Agentic

true

Architectures

  • LLM-based policy (auto-regressive)

Optimization Features

Infra Optimization

  • Experiments run on 8× A100 80GB GPUs

Training Optimization

  • DPO preference loss (offline) instead of online RL

Reproducibility

Data Urls

  • WebShop, ScienceWorld, ALFWorld (public datasets referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • ETO assumes reward differences suffice to mark entire trajectories as good/bad; action-wise blame is not identified by default.
  • Contrastive data diversity is limited by fixed expert traces and exploration scope, causing overfitting after several iterations.
  • Coarse / binary rewards (e.g., ALFWorld) limit how much useful contrastive signal ETO can extract.
  • ETO without an initial SFT base fails in experiments; it is not a drop-in replacement for expert demonstrations.

When Not To Use

  • You lack any reliable reward signal or only have coarse binary rewards and cannot generate diverse contrasts.
  • You cannot provide at least a bootstrap policy (SFT or strong prior) before exploration.
  • You need a single-shot universal policy for many unrelated tasks without per-task contrastive data.

Failure Modes

  • Overfitting to limited failure-success pairs across iterations, causing performance decline after 2–3 rounds.
  • Step-wise contrastive learning can be unstable because final reward may poorly reflect single action quality.
  • Applying ETO from a random untuned model can collapse performance (observed in self-play experiments).

Core Entities

Models

  • Llama-2-7B-Chat
  • Llama-2-13B-Chat
  • Mistral-7B
  • GPT-4
  • GPT-3.5-Turbo

Metrics

  • Average Reward
  • Success Rate

Datasets

  • WebShop
  • ScienceWorld
  • ALFWorld

Benchmarks

  • ScienceWorld-Unseen
  • WebShop (standard split)
  • ALFWorld (seen/unseen)