Train a small code LLM to write and refine programs using hindsight relabeling and prioritized replay

February 7, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen

Links

Abstract / PDF

Why It Matters For Business

CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.

Summary TLDR

CodeIt is an expert-iteration method that trains a code-focused LLM to write programs for ARC tasks by alternating program sampling (with hindsight relabeling) and learning from a prioritized replay buffer. Using a 220M CodeT5+ model and a domain-specific language (DSL), CodeIt achieves state-of-the-art on the full ARC evaluation set (59/400 tasks solved). Ablations show hindsight relabeling and prioritized sampling drive most gains, while the approach still needs a DSL, an interpreter, and seed programs.

Problem Statement

ARC tasks give few input-output examples and produce extremely sparse rewards for program search. Existing neural and symbolic methods either do not learn across tasks or are sample inefficient. The challenge is to bootstrap program synthesis and inter-task generalization with limited data and rare positive signals.

Main Contribution

Code Iteration (CodeIt): alternate sampling programs, hindsight relabeling, and prioritized experience replay.

Use of a small pretrained code LLM (CodeT5+) with a sparse text grid representation to scale to the full ARC eval set.

State-of-the-art ARC results (59/400 tasks solved) and analyses showing program refinement and primitive learning patterns.

Systematic ablations highlighting the roles of relabeling, prioritized replay, pretraining, and mutation augmentation.

Key Findings

CodeIt solves more ARC tasks than prior methods.

Numbers59/400 tasks solved (pass@3, ARC Eval)

Hindsight relabeling substantially improves sample efficiency.

NumbersAblation A2: cumulative perf. 42/400 vs CodeIt 59/400 (~+40%)

Prioritized sampling reduces forgetting and helps policy performance.

NumbersPolicy perf. drops from 49/400 to 38/400 without priority

Pretraining gives a strong head-start.

NumbersCumulative perf. 59/400 (pretrained) vs 35/400 (random init)

CodeIt refines solutions over time.

NumbersShorter solutions found later for 53% of solved tasks

Performance increases with model size but has diminishing returns.

Numbers220M > 60M; 770M small further gain; Mistral-7B slower and overfits in LoRA tests

Results

ARC evaluation solved tasks (pass@3)

Value59/400

BaselineAinooson et al. 26/400; Ferré 23/400

Policy-only performance

Value49/400

BaselineCodeIt cumulative 59/400

Effect of removing pretraining

Value35/400 cumulative

Baseline59/400 (full method)

Effect of removing priority

Valuepolicy perf. 38/400; cumulative 58/400

Baselinepolicy perf. 49/400; cumulative 59/400

Who Should Care

What To Try In 7 Days

Implement hindsight relabeling: collect syntactically valid program outputs and relabel goals with realized outputs.

Add prioritized replay keyed by percent-demo-match to keep solved behaviors during finetuning.

Bootstrapping: seed a small code LM (e.g., CodeT5 220M) with a handful of DSL examples and run a few meta-iterations of sampling+training.

Agent Features

Memory

  • prioritized replay buffer (experience memory)

Tool Use

  • interpreter execution of sampled programs

Frameworks

  • Expert Iteration (ExIt)
  • Hindsight Experience Replay
  • Prioritized Experience Replay

Architectures

  • encoder-decoder LLM policy (CodeT5+)

Optimization Features

Token Efficiency

  • sparse object-centric grid text representation to reduce token use

Training Optimization

  • prioritized sampling from replay
  • likelihood finetuning on synthetic experiences
  • mutation-based data augmentation

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a hand-designed DSL and an interpreter to evaluate candidate programs.
  • Needs an initial set of seed programs; learning tabula rasa is slower.
  • Struggles to find solutions when ground-truth programs exceed ~11 lines.
  • Performs worse on numerical or pure-logic tasks compared to object-interaction tasks.

When Not To Use

  • If you lack a deterministic interpreter to run candidate programs.
  • When the domain has no compact DSL or symbolic semantics.
  • For tasks that require very long programs or deep multi-step plans without intermediate supervision.

Failure Modes

  • Catastrophic forgetting when replay is uniform (no prioritization).
  • Stagnation if hindsight relabeling is disabled (too few positives).
  • Overfitting large LMs with simple LoRA finetuning on small synthetic corpora.

Core Entities

Models

  • CodeT5+ (220M)
  • CodeT5 60M
  • CodeT5 770M
  • Mistral-7B (tested qualitatively)

Metrics

  • pass@3
  • pass@1
  • solved tasks (count)
  • demonstration performance (percent solved demos)

Datasets

  • ARC (train 400, eval 400 / eval412)
  • ConceptARC

Benchmarks

  • ARC evaluation set
  • ConceptARC

Context Entities

Models

  • GPT-4 (comparison baseline)
  • text-davinci-003 (comparison)

Datasets

  • ARC hidden/competition splits (discussed)