DPO + generated trajectories: train recommender RL agents with very little human data and short compute

August 28, 20247 min

Overview

Production Readiness

0.25

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

0

Authors

Shuang Feng, Grace Feng

Links

Abstract / PDF

Why It Matters For Business

You can cut expensive human trajectory collection and short-run compute by using DPO plus synthetic rollouts, letting teams prototype RL-based recommenders faster and cheaper in simulation.

Summary TLDR

The authors adapt LLM-based policies (BERT/BART) to WebShop, compare Direct Preference Optimization (DPO) vs PPO, and show DPO trains faster and uses far less human data. Training a DPO agent for ~30–60 minutes (≈3000 steps) without images reached ~19% success on the WebShop simulator. Using 100 machine-generated 'preferred' trajectories produced similar performance to training on 1,200 human trajectories, suggesting generated trajectories can reduce costly human data collection.

Problem Statement

Recommender systems trained with supervised models focus on short-term click signals and create feedback loops. This work asks: can we train RL recommenders more cheaply and quickly by (1) using LLM-based policies and (2) using preference-based learning with generated trajectories instead of large volumes of human trajectories?

Main Contribution

Implemented DPO and PPO fine-tuning on WebShop starting from imitation-learning checkpoints (BERT/BART).

Showed Direct Preference Optimization (DPO) reaches higher success and scores than PPO within short training time (<1 hour) on WebShop simulator.

Demonstrated self-learning with generated trajectories (100 synthetic perfect-rollouts) can match performance of models trained on 1,200 human trajectories.

Released code branch and experimental scripts (GitHub link provided for updates).

Key Findings

DPO trains faster and achieves higher task performance than PPO in the WebShop simulator under short training budgets.

NumbersDPO ~19% success after ~3000 steps/30–60 min vs PPO ~15% after 2 hours

A small set of generated preferred trajectories can substitute many human trajectories for DPO training.

NumbersDPO trained with 100 generated trajectories performed comparably to DPO trained with 1,200 human trajectories (3000-step

Results are measured in a simulator and exclude image inputs, limiting direct comparison to prior work that used images.

NumbersAll agents trained without image data; training limited to 3000 steps (<1 hour)

Results

success rate (simulator)

ValueDPO ≈ 19% after ~3000 steps (~30–60 min); PPO ≈ 15% after longer (2h) runs

BaselinePPO (same checkpoint)

task score (simulator)

ValueDPO yields higher average scores than PPO under same short training budget

BaselinePPO (same checkpoint)

data efficiency: synthetic vs human trajectories

ValueDPO trained with 100 generated preferred trajectories ≈ DPO trained with 1,200 human trajectories (3000-step training)

BaselineDPO with 1,200 human trajectories

Who Should Care

What To Try In 7 Days

Start from your imitation-policy checkpoint and run a 1-hour DPO fine-tune in simulator to compare with existing policy.

Generate 100 high-quality simulated 'successful' rollouts and train DPO on those to evaluate data-cost tradeoffs.

Use Thompson sampling or similar offline rollout evaluation to estimate short-run success rates before production tests.

Agent Features

Planning

  • sequential decision-making (MDP-style)

Tool Use

  • Thompson sampling (online rollouts)
  • Pyserini (BM25 retrieval index)

Frameworks

  • DPO
  • PPO

Is Agentic

true

Architectures

  • LLM-based policy (BERT/BART fine-tuned)

Optimization Features

Infra Optimization

  • short-training budgets on T4 GPUs

Training Optimization

  • Direct Preference Optimization (DPO)
  • contrastive pairwise preference training
  • PPO (baseline)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run with minimal steps (3000) and short wall-clock time; longer training may change rankings.
  • No image data used — differs from prior WebShop results that include images.
  • Results are simulator-only; real-world user signals and offline-to-online gaps not evaluated.
  • Generated trajectories were few (100) and may not generalize to more complex tasks.

When Not To Use

  • When you must include multimodal inputs (images) — this work omits images.
  • When you require production-grade evaluation with real users — paper reports simulator metrics only.
  • When you can run long RL training where PPO's stability may be advantageous.

Failure Modes

  • Synthetic rollouts produce policy blind spots if generated trajectories are biased or unrealistic.
  • Short-run training may overfit imitation checkpoint artifacts and not generalize to novel instructions.
  • Sim-to-real gap: simulator success may not translate to online user engagement.

Core Entities

Models

  • BERT
  • BART
  • DPO
  • PPO
  • InstructGPT (reference)

Metrics

  • success rate
  • average score
  • purchase reward

Datasets

  • WebShop

Benchmarks

  • WebShop

Context Entities

Models

  • RNN/CNN (prior work)
  • T5 (related query-generation work)

Datasets

  • Virtual-Taobao
  • RecoGym