Teach LLM agents by learning from their failed runs: collect failure trajectories, make failure-vs-success pairs, and fine-tune via DPO.

Overview

Decision SnapshotReady For Pilot

ETO uses simple offline preference fine-tuning, so it is practical and low-risk for improving SFT agents; it requires curated contrastive data and monitoring to avoid overfitting.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you run LLM agents in interactive tasks, adding a cheap offline loop that learns from the agent's failure cases can boost task reward and robustness to unseen variations without heavy online RL.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

ETO is a simple, iterative training loop for LLM agents: start from a behavioral-cloning (SFT) agent, let it explore to collect failed trajectories, pair those failures with expert successes, then fine-tune the model with a preference-based loss (DPO). Across three agent benchmarks (WebShop, ScienceWorld, ALFWorld) ETO consistently improves average reward and sample efficiency versus SFT and other baselines. ETO helps generalization to unseen task variants but needs a supply of diverse contrastive pairs and can overfit if iterated too many times.

Problem Statement

Open LLMs fine-tuned only on expert demonstrations (behavioral cloning) often fail to explore and generalize. The paper asks: can we improve agent policies by learning from the agent's own failed trajectories via contrastive preference learning?

Main Contribution

ETO algorithm: an iterative exploration + training loop that collects failure trajectories and learns from failure-success trajectory pairs using DPO preference loss.

Large-scale evaluation on three interactive agent datasets (WebShop, ScienceWorld, ALFWorld) showing consistent gains over SFT and strong baselines.

Key Findings

ETO improves average reward over SFT across agent benchmarks.

NumbersWebShop: 63.1 → 67.4 avg reward (Table 2)

Practical UseIf you already SFT an LLM agent, adding ETO can raise task rewards by several absolute points with modest extra tuning.

Evidence RefTable 2

ETO substantially boosts out-of-distribution performance on ScienceWorld.

NumbersScienceWorld (Unseen): 53.0 → 65.0 avg reward (+12.0 abs, ~22.6% rel)

Practical UseWhen you need better generalization to unseen variations, learning from failures yields large gains versus pure imitation.

Evidence RefTable 2; paper text reports ~22%

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average Reward (WebShop)	SFT 63.1 → ETO 67.4	SFT	+4.3	WebShop test	Table 2; WebShop avg reward values	Table 2
Average Reward (ScienceWorld Seen)	SFT 67.4 → ETO 73.8	SFT	+6.4	ScienceWorld Seen	Table 2; ScienceWorld seen numbers	Table 2

What To Try In 7 Days

Train a base SFT agent from your expert traces.

Run the agent on training tasks and log failed trajectories.

Construct failure-vs-success trajectory pairs and fine-tune using DPO for 1–2 iterations and validate on held-out variants.

Agent Features

Memory

Short-term trajectory context in prompts

Planning

ReAct-style planning with CoT (reasoning before actions)

Tool Use

Environment APIs (WebShop/ScienceWorld/ALFWorld)

Frameworks

LoRA

Is Agentic

Yes

Architectures

LLM-based policy (auto-regressive)

Optimization Features

Infra Optimization

Experiments run on 8× A100 80GB GPUs

Training Optimization

DPO preference loss (offline) instead of online RL

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Yifan-Song793/ETO

Data URLs

WebShop, ScienceWorld, ALFWorld (public datasets referenced in paper)

Risks & Boundaries

Limitations

ETO assumes reward differences suffice to mark entire trajectories as good/bad; action-wise blame is not identified by default.

Contrastive data diversity is limited by fixed expert traces and exploration scope, causing overfitting after several iterations.

When Not To Use

You lack any reliable reward signal or only have coarse binary rewards and cannot generate diverse contrasts.

You cannot provide at least a bootstrap policy (SFT or strong prior) before exploration.

Failure Modes

Overfitting to limited failure-success pairs across iterations, causing performance decline after 2–3 rounds.

Step-wise contrastive learning can be unstable because final reward may poorly reflect single action quality.

Core Entities

Models

Llama-2-7B-ChatLlama-2-13B-ChatMistral-7BGPT-4GPT-3.5-Turbo

Metrics

Average RewardSuccess Rate

Datasets

WebShopScienceWorldALFWorld

Benchmarks

ScienceWorld-UnseenWebShop (standard split)ALFWorld (seen/unseen)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ETO improves average reward over SFT across agent benchmarks.

ETO substantially boosts out-of-distribution performance on ScienceWorld.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding