Train a small code LLM to write and refine programs using hindsight relabeling and prioritized replay

February 7, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is practical for DSL-backed program synthesis and small-model deployments, but it depends on a hand-crafted DSL, an interpreter, and seed programs; expect moderate engineering effort to adapt to new domains.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.

Who Should Care

Summary TLDR

CodeIt is an expert-iteration method that trains a code-focused LLM to write programs for ARC tasks by alternating program sampling (with hindsight relabeling) and learning from a prioritized replay buffer. Using a 220M CodeT5+ model and a domain-specific language (DSL), CodeIt achieves state-of-the-art on the full ARC evaluation set (59/400 tasks solved). Ablations show hindsight relabeling and prioritized sampling drive most gains, while the approach still needs a DSL, an interpreter, and seed programs.

Problem Statement

ARC tasks give few input-output examples and produce extremely sparse rewards for program search. Existing neural and symbolic methods either do not learn across tasks or are sample inefficient. The challenge is to bootstrap program synthesis and inter-task generalization with limited data and rare positive signals.

Main Contribution

Code Iteration (CodeIt): alternate sampling programs, hindsight relabeling, and prioritized experience replay.

Use of a small pretrained code LLM (CodeT5+) with a sparse text grid representation to scale to the full ARC eval set.

Key Findings

CodeIt solves more ARC tasks than prior methods.

Numbers59/400 tasks solved (pass@3, ARC Eval)

Practical UseIf you use a DSL and interpreter, CodeIt can find correct programs for ~15% of ARC tasks without human labels.

Evidence RefTable 1, Sec. 3.2

Hindsight relabeling substantially improves sample efficiency.

NumbersAblation A2: cumulative perf. 42/400 vs CodeIt 59/400 (~+40%)

Practical UseRelabeling every syntactically valid program with its realized output yields more usable training data than filtering only correct programs.

Evidence RefTable 2, Sec. 3.3, Conclusion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ARC evaluation solved tasks (pass@3)59/400Ainooson et al. 26/400; Ferré 23/400better than prior SOTA on full eval setARC Eval (400)Table 1 shows CodeIt 59/400 vs prior symbolic/neural baselinesTable 1, Sec. 3.2
Policy-only performance49/400CodeIt cumulative 59/400policy produces correct program in current meta-iteration for 49 tasksARC EvalTable 2 (policy only perf.)Table 2, Sec. 3.3

What To Try In 7 Days

Implement hindsight relabeling: collect syntactically valid program outputs and relabel goals with realized outputs.

Add prioritized replay keyed by percent-demo-match to keep solved behaviors during finetuning.

Bootstrapping: seed a small code LM (e.g., CodeT5 220M) with a handful of DSL examples and run a few meta-iterations of sampling+training.

Agent Features

Memory
prioritized replay buffer (experience memory)
Tool Use
interpreter execution of sampled programs
Frameworks
Expert Iteration (ExIt)Hindsight Experience ReplayPrioritized Experience Replay
Architectures
encoder-decoder LLM policy (CodeT5+)

Optimization Features

Token Efficiency
sparse object-centric grid text representation to reduce token use
Training Optimization
prioritized sampling from replaylikelihood finetuning on synthetic experiencesmutation-based data augmentation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Risks & Boundaries

Limitations

Requires a hand-designed DSL and an interpreter to evaluate candidate programs.

Needs an initial set of seed programs; learning tabula rasa is slower.

When Not To Use

If you lack a deterministic interpreter to run candidate programs.

When the domain has no compact DSL or symbolic semantics.

Failure Modes

Catastrophic forgetting when replay is uniform (no prioritization).

Stagnation if hindsight relabeling is disabled (too few positives).

Core Entities

Models

CodeT5+ (220M)CodeT5 60MCodeT5 770MMistral-7B (tested qualitatively)

Metrics

pass@3pass@1solved tasks (count)demonstration performance (percent solved demos)

Datasets

ARC (train 400, eval 400 / eval412)ConceptARC

Benchmarks

ARC evaluation setConceptARC

Context Entities

Models

GPT-4 (comparison baseline)text-davinci-003 (comparison)

Datasets

ARC hidden/competition splits (discussed)