Use LLMs to auto-generate trajectory preferences and rebuild rewards so RL learns faster with fewer experts

June 28, 20246 min

Overview

Production Readiness

0.35

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

1

Authors

Zichao Shen, Tianchen Zhu, Qingyun Sun, Shiqi Gao, Jianxin Li

Links

Abstract / PDF

Why It Matters For Business

LLM4PG cuts time and expert hours spent on reward engineering by turning natural-language constraints into rewards, which can speed RL training in sparse or constrained tasks.

Summary TLDR

The paper introduces LLM4PG: a two-stage pipeline that uses a large language model (LLM) to rank pairs of agent trajectories (LLM-as-judge), trains a compact reward predictor from those preferences, and then uses that predictor as the reward signal for downstream RL (PPO). On MiniGrid tasks with sparse or constrained rewards, LLM4PG accelerated convergence (converged ~50k steps vs >100k with original rewards), handled natural-language constraints (e.g., ‘‘drop key 3 times’’), and outperformed Lagrange PPO and manual shaping in the presented cases. Experiments are limited to MiniGrid and use Mixtral and QWen LLMs; costs, judge bias, and step-wise reward needs are not fully explored.

Problem Statement

Designing good reward functions for complex, constrained environments is hard and expensive. Human preference labels help, but collecting them is slow and costly. Can LLMs automatically generate accurate trajectory preferences and produce reward signals that speed up RL training without heavy expert involvement?

Main Contribution

LLM4PG: a practical pipeline that turns LLM pairwise rankings of natural-language trajectory summaries into a trainable reward predictor.

Demonstration that LLM-derived rewards speed up RL learning and handle language-expressed constraints in MiniGrid tasks.

Empirical comparison showing LLM4PG often beats original rewards, naive shaping, and Lagrange PPO on the tested MiniGrid setups.

Key Findings

LLM-derived reward predictors produce faster RL convergence on MiniGrid-Unlock.

NumbersConvergence at ~50,000 steps vs >100,000 steps with original rewards

LLM4PG can encode natural-language constraints and train policies that satisfy them.

LLM4PG outperforms Lagrange PPO and manual shaping on tested constrained tasks.

Results

steps-to-converge

Value~50,000 steps (LLM4PG)

Baseline>100,000 steps (original rewards)

constraint satisfaction (drop-key=3)

Valueachieved target drop count while maintaining high reward

BaselineLagrange PPO and PPO

task success rate / training progress

ValueLLM4PG > original/manual shaping in reported runs

Baselineoriginal rewards, manually shaped rewards

Who Should Care

What To Try In 7 Days

Reproduce LLM4PG on MiniGrid using an LLM API (Mixtral/QWen) and PPO to confirm faster convergence.

Build a simple language interpreter that converts key state features into short text summaries.

Collect pairwise trajectory comparisons via the LLM and train a small feedforward reward predictor from them.

Agent Features

Architectures

  • PPO
  • reward predictor (2-layer FC)

Optimization Features

Training Optimization

  • train small reward predictor from LLM preferences

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments limited to MiniGrid; unclear transfer to larger or real-world tasks.
  • LLM query cost and latency are not measured or optimized.
  • Reward predictor uses episodic feedback only; no per-step real-time reward shown.

When Not To Use

  • Safety-critical systems where LLM hallucination risk is unacceptable
  • Tasks that require per-step real-time rewards or tight latency constraints
  • Settings where LLM query cost makes scaling prohibitive

Failure Modes

  • LLM produces inconsistent or biased preferences, leading to misaligned reward predictors
  • Reward predictor overfits LLM judgments and enables reward hacking by the agent
  • Method fails to scale beyond simple grid environments

Core Entities

Models

  • Mixtral 8x7B
  • QWenmax
  • PPO

Metrics

  • success rate
  • steps-to-converge
  • episode reward
  • constraint cost (squared diff)

Datasets

  • MiniGrid-Unlock-v0
  • MiniGrid-LavaGapS7-v0

Benchmarks

  • MiniGrid