Fixes RND's 'bonus inconsistency' by distilling many random targets to produce pseudo-counts for better exploration and offline conservatism

January 18, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

3

Authors

Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

Links

Abstract / PDF

Why It Matters For Business

DRND improves exploration bonuses while adding negligible compute, so teams can get better exploration or safer offline policies without changing infra or large engineering effort.

Summary TLDR

RND (random network distillation) gives noisy and inconsistent novelty bonuses. DRND learns a predictor that fits a distribution of fixed random target networks, then uses the predictor's statistics as two bonus terms: an initial uniformizing novelty bonus and a later pseudo-count-like bonus that scales ~1/√n. DRND adds little compute, plugs into PPO or SAC, improves exploration on Atari and robotic tasks, and serves as an effective anti-exploration penalty in offline D4RL, improving average ensemble-free offline scores.

Problem Statement

Random Network Distillation (RND) uses prediction error as a novelty bonus but produces inconsistent bonuses: extreme, non-uniform bonuses at initialization and poor alignment with state visitation counts after training. This 'bonus inconsistency' weakens deep exploration and limits RND's use as an anti-exploration penalty in offline RL.

Main Contribution

Identify 'bonus inconsistency' in RND and separate it into initial and final inconsistency.

Propose DRND: distill a distribution of N fixed random target networks into one predictor and derive two bonus terms — a early-stage high-pass novelty term and a late-stage pseudo-count estimate.

Show theoretical links: analytic expressions under linear models and an unbiased statistic that estimates 1/n (pseudo-count).

Empirically demonstrate better bonus distributions, faster/better learning on Atari and robotics tasks, and stronger anti-exploration performance in D4RL offline benchmarks with minimal extra compute.

Key Findings

DRND produces a much more uniform initial bonus than RND

NumbersDKL(P||U): RND 0.0377±0.0248 vs DRND 0.0070±0.0063 (before training)

After training DRND's bonus aligns better with a 1/√n pseudo-count target than RND

NumbersDKL(P||1/√n): RND 0.0946±0.0409 vs DRND 0.0476±0.0389 (after training)

SAC-DRND improves average normalized offline scores vs strong ensemble-free baselines

NumbersAverage score (ensemble-free) SAC-DRND 86.0 vs SAC-RND 82.6 (final-step averages)

DRND adds negligible runtime overhead compared to RND

NumbersUpdates/sec similar; runtime slightly lower for DRND in D4RL medium tests

Results

Initial bonus uniformity

ValueDKL(P||U) before training

BaselineRND

Alignment with 1/√n after training

ValueDKL(P||1/√n) after training

BaselineRND

Average normalized D4RL score (ensemble-free set)

Value86.0 (SAC-DRND)

BaselineSAC-RND 82.6

Compute throughput

Valueupdates/sec comparable

BaselineRND

Who Should Care

What To Try In 7 Days

Replace RND with DRND (N=10, α=0.9) in an existing PPO or SAC pipeline and compare returns.

Run the mini-dataset bonus diagnostic from the paper to check RND bonus inconsistency in your task.

Use SAC-DRND as an anti-exploration penalty on a small D4RL-like offline dataset and measure value conservatism.

Agent Features

Memory

  • implicit pseudo-count in predictor weights (estimates 1/n)

Tool Use

  • intrinsic reward module (plug-in for PPO or SAC)
  • LoRA

Frameworks

  • PPO integration
  • SAC integration

Is Agentic

true

Architectures

  • predictor network + N fixed random target networks (no training of targets)
  • bilinear first layer + FiLM for offline tasks (implementation detail)

Optimization Features

Infra Optimization

  • works with PyTorch or Jax; no special hardware required

System Optimization

  • keeps runtime similar to RND; authors report comparable or slightly lower updates/sec

Training Optimization

  • only predictor updated; target networks fixed (no extra target backprop)
  • uses fixed N targets to reduce extreme initial bonuses

Inference Optimization

  • intrinsic bonus computation is simple (means and moments) and batched

Reproducibility

Data Urls

  • D4RL (public)
  • Atari (public)
  • Adroit, Fetch (public)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Second bonus (pseudo-count estimate) has high variance for small n, so estimates are noisy on tiny datasets.
  • Performance gains are environment-dependent; little or no improvement in simple tasks (e.g., 'Pen' Adroit).
  • Requires selecting N and α; authors recommend N≈10 and α≈0.9 but sensitivity exists.

When Not To Use

  • When the task does not require deep exploration or has rich dense rewards.
  • When dataset sizes are tiny and pseudo-count variance dominates.
  • If you cannot tune α or N for your domain.

Failure Modes

  • Over-reliance on b1 early could overweight novelty in stochastic/noisy states.
  • Poor α/N settings can reduce benefits or revert to RND behavior (α=1, N=1).
  • May mis-estimate counts in very high-dimensional states without feature or architecture adjustments.

Core Entities

Models

  • DRND
  • RND
  • SAC-DRND
  • SAC-RND
  • PPO-DRND (PPO+DRND)
  • CFN
  • ICM
  • PPO
  • SAC

Metrics

  • KL divergence (bonus vs target distributions)
  • mean episodic return
  • average normalized D4RL score
  • success rate (AntMaze)
  • updates per second / runtime

Datasets

  • D4RL (MuJoCo, AntMaze)
  • Atari (Montezuma's Revenge, Gravitar, Venture)
  • Adroit
  • Fetch manipulation
  • custom mini one-hot dataset for inconsistency tests

Benchmarks

  • D4RL
  • LoRA
  • Adroit
  • Fetch
  • Expected Online Performance (EOP)