Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
If you run LLM agents in interactive tasks, adding a cheap offline loop that learns from the agent's failure cases can boost task reward and robustness to unseen variations without heavy online RL.
Summary TLDR
ETO is a simple, iterative training loop for LLM agents: start from a behavioral-cloning (SFT) agent, let it explore to collect failed trajectories, pair those failures with expert successes, then fine-tune the model with a preference-based loss (DPO). Across three agent benchmarks (WebShop, ScienceWorld, ALFWorld) ETO consistently improves average reward and sample efficiency versus SFT and other baselines. ETO helps generalization to unseen task variants but needs a supply of diverse contrastive pairs and can overfit if iterated too many times.
Problem Statement
Open LLMs fine-tuned only on expert demonstrations (behavioral cloning) often fail to explore and generalize. The paper asks: can we improve agent policies by learning from the agent's own failed trajectories via contrastive preference learning?
Main Contribution
ETO algorithm: an iterative exploration + training loop that collects failure trajectories and learns from failure-success trajectory pairs using DPO preference loss.
Large-scale evaluation on three interactive agent datasets (WebShop, ScienceWorld, ALFWorld) showing consistent gains over SFT and strong baselines.
Analysis: ETO improves out-of-distribution generalization and action efficiency, but benefits plateau or decline after a few iterations and can fail without initial SFT.
Key Findings
ETO improves average reward over SFT across agent benchmarks.
ETO substantially boosts out-of-distribution performance on ScienceWorld.
ETO improves action efficiency: reaches higher rewards in fewer steps.
ETO without an initial SFT baseline fails; combining RFT then ETO works best when experts are missing.
Multiple ETO iterations can help at first but then overfit.
Results
Average Reward (WebShop)
Average Reward (ScienceWorld Seen)
Average Reward (ScienceWorld Unseen)
Average Reward (ALFWorld Seen)
Self-play without BC (WebShop)
Who Should Care
What To Try In 7 Days
Train a base SFT agent from your expert traces.
Run the agent on training tasks and log failed trajectories.
Construct failure-vs-success trajectory pairs and fine-tune using DPO for 1–2 iterations and validate on held-out variants.
Agent Features
Memory
- Short-term trajectory context in prompts
Planning
- ReAct-style planning with CoT (reasoning before actions)
Tool Use
- Environment APIs (WebShop/ScienceWorld/ALFWorld)
Frameworks
- LoRA
Is Agentic
true
Architectures
- LLM-based policy (auto-regressive)
Optimization Features
Infra Optimization
- Experiments run on 8× A100 80GB GPUs
Training Optimization
- DPO preference loss (offline) instead of online RL
Reproducibility
Code Urls
Data Urls
- WebShop, ScienceWorld, ALFWorld (public datasets referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- ETO assumes reward differences suffice to mark entire trajectories as good/bad; action-wise blame is not identified by default.
- Contrastive data diversity is limited by fixed expert traces and exploration scope, causing overfitting after several iterations.
- Coarse / binary rewards (e.g., ALFWorld) limit how much useful contrastive signal ETO can extract.
- ETO without an initial SFT base fails in experiments; it is not a drop-in replacement for expert demonstrations.
When Not To Use
- You lack any reliable reward signal or only have coarse binary rewards and cannot generate diverse contrasts.
- You cannot provide at least a bootstrap policy (SFT or strong prior) before exploration.
- You need a single-shot universal policy for many unrelated tasks without per-task contrastive data.
Failure Modes
- Overfitting to limited failure-success pairs across iterations, causing performance decline after 2–3 rounds.
- Step-wise contrastive learning can be unstable because final reward may poorly reflect single action quality.
- Applying ETO from a random untuned model can collapse performance (observed in self-play experiments).
Core Entities
Models
- Llama-2-7B-Chat
- Llama-2-13B-Chat
- Mistral-7B
- GPT-4
- GPT-3.5-Turbo
Metrics
- Average Reward
- Success Rate
Datasets
- WebShop
- ScienceWorld
- ALFWorld
Benchmarks
- ScienceWorld-Unseen
- WebShop (standard split)
- ALFWorld (seen/unseen)

