DPO + generated trajectories: train recommender RL agents with very little human data and short compute

August 28, 20247 min

Overview

Decision SnapshotNeeds Validation

Promising simulator results under tight compute/data budgets. Evidence is limited by short training time, lack of image inputs, and simulator-only evaluation, so production readiness is low without more scaling and online A/B tests.

Citations0

Evidence Strength0.60

Confidence0.60

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 25%

Novelty: 45%

Authors

Shuang Feng, Grace Feng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut expensive human trajectory collection and short-run compute by using DPO plus synthetic rollouts, letting teams prototype RL-based recommenders faster and cheaper in simulation.

Who Should Care

Summary TLDR

The authors adapt LLM-based policies (BERT/BART) to WebShop, compare Direct Preference Optimization (DPO) vs PPO, and show DPO trains faster and uses far less human data. Training a DPO agent for ~30–60 minutes (≈3000 steps) without images reached ~19% success on the WebShop simulator. Using 100 machine-generated 'preferred' trajectories produced similar performance to training on 1,200 human trajectories, suggesting generated trajectories can reduce costly human data collection.

Problem Statement

Recommender systems trained with supervised models focus on short-term click signals and create feedback loops. This work asks: can we train RL recommenders more cheaply and quickly by (1) using LLM-based policies and (2) using preference-based learning with generated trajectories instead of large volumes of human trajectories?

Main Contribution

Implemented DPO and PPO fine-tuning on WebShop starting from imitation-learning checkpoints (BERT/BART).

Showed Direct Preference Optimization (DPO) reaches higher success and scores than PPO within short training time (<1 hour) on WebShop simulator.

Key Findings

DPO trains faster and achieves higher task performance than PPO in the WebShop simulator under short training budgets.

NumbersDPO ~19% success after ~3000 steps/3060 min vs PPO ~15% after 2 hours

Practical UseFor quick prototyping or low-cost training runs, prefer DPO over PPO when you start from an imitation-policy checkpoint.

Evidence RefAbstract, Section 4, Figures 3-4

A small set of generated preferred trajectories can substitute many human trajectories for DPO training.

NumbersDPO trained with 100 generated trajectories performed comparably to DPO trained with 1,200 human trajectories (3000-step

Practical UseIf human data is costly, generate high-quality synthetic rollouts and use DPO to cut data collection cost while preserving performance.

Evidence RefSection 4, Figures 5-6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
success rate (simulator)DPO ≈ 19% after ~3000 steps (~3060 min); PPO ≈ 15% after longer (2h) runsPPO (same checkpoint)+4 pp (absolute) in short-run comparisonWebShop simulator (no image inputs)Abstract, Section 4, Figures 3-4Figures 3-4
task score (simulator)DPO yields higher average scores than PPO under same short training budgetPPO (same checkpoint)higher average scores (see Figures 3 and 5)WebShop simulator (no image inputs)Section 4, Figures 3 & 5Figures 3,5

What To Try In 7 Days

Start from your imitation-policy checkpoint and run a 1-hour DPO fine-tune in simulator to compare with existing policy.

Generate 100 high-quality simulated 'successful' rollouts and train DPO on those to evaluate data-cost tradeoffs.

Use Thompson sampling or similar offline rollout evaluation to estimate short-run success rates before production tests.

Agent Features

Planning
sequential decision-making (MDP-style)
Tool Use
Thompson sampling (online rollouts)Pyserini (BM25 retrieval index)
Frameworks
DPOPPO
Is Agentic

Yes

Architectures
LLM-based policy (BERT/BART fine-tuned)

Optimization Features

Infra Optimization
short-training budgets on T4 GPUs
Training Optimization
Direct Preference Optimization (DPO)contrastive pairwise preference trainingPPO (baseline)

Reproducibility

Risks & Boundaries

Limitations

Experiments run with minimal steps (3000) and short wall-clock time; longer training may change rankings.

No image data used — differs from prior WebShop results that include images.

When Not To Use

When you must include multimodal inputs (images) — this work omits images.

When you require production-grade evaluation with real users — paper reports simulator metrics only.

Failure Modes

Synthetic rollouts produce policy blind spots if generated trajectories are biased or unrealistic.

Short-run training may overfit imitation checkpoint artifacts and not generalize to novel instructions.

Core Entities

Models

BERTBARTDPOPPOInstructGPT (reference)

Metrics

success rateaverage scorepurchase reward

Datasets

WebShop

Benchmarks

WebShop

Context Entities

Models

RNN/CNN (prior work)T5 (related query-generation work)

Datasets

Virtual-TaobaoRecoGym