KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

August 23, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Jason R Brown, Lennie Wells, Edward James Young, Sergio Bacallado

Links

Abstract / PDF

Why It Matters For Business

KLQ offers a theoretically grounded alternative to PPO that matches compute and reward performance and can produce outputs preferred by LLM judges, potentially improving product output quality without added infrastructure cost.

Summary TLDR

The authors introduce KLQ, an on-policy token-level Q-learning method for RLHF that uses a KL-regularised action-value decomposition and λ-returns. KLQ has a clearer theoretical motivation and is shown analytically to be equivalent to a modified PPO. Empirically on TL;DR summarisation and Anthropic-HH single-turn dialogue, KLQ matches PPO on reward metrics and per-update compute, and yields higher win-rates in LLM-as-a-judge comparisons, though experiments are small-scale and noisy.

Problem Statement

PPO is the de-facto algorithm for LM-RLHF but handles the KL constraint heuristically. We need a token-level action-value method that (1) fits the KL-regularised LM setting, (2) lets you initialise from SFT policies, and (3) offers clearer theory and comparable or better empirical outcomes.

Main Contribution

KLQ: a token-level, KL-regularised Q-learning algorithm for LM-RLHF that trains an action-value function parametrised by the policy and a value head.

A λ-return based value estimator adapted to action-value learning to propagate sparse final reward signals across tokens.

An analytic proof that KLQ updates are equivalent to updates from a modified PPO (reverse-KL PPO-penalty) under a mapping between Q and (π,V).

Empirical evaluation on TL;DR summarisation and Anthropic-HH dialogue showing parity with PPO on reward and better LLM-as-a-judge win-rates.

Key Findings

KLQ matches PPO on the LM-RLHF reward objective and uses about the same compute per run.

Numbers75,000 episodes per run; training ~5 hours on 4 A100 GPUs; near-identical final validation rewards (figures shown).

KLQ's final policies are consistently preferred by an LLM judge compared to PPO across tested KL penalties.

NumbersEvaluations used 32 validation prompts and both orderings (64 comparisons) per KL coefficient; KLQ had higher win-rate '

Optimising KLQ's ℓ2 regression on λ-returns is analytically equivalent to a modified PPO update under a mapping between Q and (π,V).

Results

Final RLHF reward (validation)

ValueKLQ ≈ PPO (similar final validation reward curves)

BaselinePPO

Compute per run (wall-clock)

Value≈5 hours on 4 A100 GPUs

BaselinePPO

LLM-as-a-judge win-rate

ValueKLQ > PPO across tested KL penalties

BaselinePPO

Reward vs KL trade-off (Pareto frontier)

ValueSimilar trade-offs for KLQ and PPO; differences mostly noisy

BaselinePPO

Who Should Care

What To Try In 7 Days

Run a small-scale KLQ finetune from your SFT model on a representative dataset to compare reward and judged quality.

Use the same hyperparameters and compute budget as your PPO runs to isolate algorithmic differences.

Run an LLM-as-a-judge comparison on ~30 prompts to get a quick signal of qualitative preference.

Optimization Features

Token Efficiency

  • token-level (per-token) reward attribution via KL regularisation

Training Optimization

  • λ-returns for action-values (propagates sparse final rewards)
  • action-value decomposition linking Q to policy and value head
  • on-policy minibatch gradient updates (multiple epochs per batch)

Reproducibility

Data Urls

  • TL;DR dataset
  • Anthropic-HH dataset
  • HuggingFace model hub (used SFT and reward models)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments are small-scale: 75k episodes per run and ~5 hours on 4 A100 GPUs; few repeats.
  • Used off-the-shelf SFT and reward models from HuggingFace—reward model quality and biases not controlled.
  • LLM-as-a-judge results have wide confidence intervals due to limited comparisons.
  • No large hyperparameter sweep or many independent seeds to quantify variance.

When Not To Use

  • If you need thoroughly validated large-scale RLHF results; KLQ currently has only preliminary-scale evidence.
  • If you cannot add a token-level value head or prefer completion-level grouped-rollout methods without value heads.

Failure Modes

  • Training noise can dominate small differences between KLQ and PPO, leading to ambiguous outcomes.
  • Poor reward models or SFT policies may bias both methods and hide algorithmic benefits.
  • Un-tuned hyperparameters might make KLQ underperform PPO despite theoretical advantages.

Core Entities

Models

  • SFT
  • Pythia-1B based reward model (TL;DR)
  • GPT2-large based reward model (Anthropic-HH)
  • GPT-4o mini (LLM-as-a-judge)
  • TRL (custom fork) training library

Metrics

  • RLHF reward model score
  • SFT
  • LLM-as-a-judge win-rate
  • wall-clock training time

Datasets

  • TL;DR (summarisation)
  • Anthropic-HH (single-turn dialogue)