Train a small code LLM to write and refine programs using hindsight relabeling and prioritized replay

Overview

Decision SnapshotNeeds Validation

The method is practical for DSL-backed program synthesis and small-model deployments, but it depends on a hand-crafted DSL, an interpreter, and seed programs; expect moderate engineering effort to adapt to new domains.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

CodeIt is an expert-iteration method that trains a code-focused LLM to write programs for ARC tasks by alternating program sampling (with hindsight relabeling) and learning from a prioritized replay buffer. Using a 220M CodeT5+ model and a domain-specific language (DSL), CodeIt achieves state-of-the-art on the full ARC evaluation set (59/400 tasks solved). Ablations show hindsight relabeling and prioritized sampling drive most gains, while the approach still needs a DSL, an interpreter, and seed programs.

Problem Statement

ARC tasks give few input-output examples and produce extremely sparse rewards for program search. Existing neural and symbolic methods either do not learn across tasks or are sample inefficient. The challenge is to bootstrap program synthesis and inter-task generalization with limited data and rare positive signals.

Main Contribution

Code Iteration (CodeIt): alternate sampling programs, hindsight relabeling, and prioritized experience replay.

Use of a small pretrained code LLM (CodeT5+) with a sparse text grid representation to scale to the full ARC eval set.

Key Findings

CodeIt solves more ARC tasks than prior methods.

Numbers59/400 tasks solved (pass@3, ARC Eval)

Practical UseIf you use a DSL and interpreter, CodeIt can find correct programs for ~15% of ARC tasks without human labels.

Evidence RefTable 1, Sec. 3.2

Hindsight relabeling substantially improves sample efficiency.

NumbersAblation A2: cumulative perf. 42/400 vs CodeIt 59/400 (~+40%)

Practical UseRelabeling every syntactically valid program with its realized output yields more usable training data than filtering only correct programs.

Evidence RefTable 2, Sec. 3.3, Conclusion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ARC evaluation solved tasks (pass@3)	59/400	Ainooson et al. 26/400; Ferré 23/400	better than prior SOTA on full eval set	ARC Eval (400)	Table 1 shows CodeIt 59/400 vs prior symbolic/neural baselines	Table 1, Sec. 3.2
Policy-only performance	49/400	CodeIt cumulative 59/400	policy produces correct program in current meta-iteration for 49 tasks	ARC Eval	Table 2 (policy only perf.)	Table 2, Sec. 3.3

What To Try In 7 Days

Implement hindsight relabeling: collect syntactically valid program outputs and relabel goals with realized outputs.

Add prioritized replay keyed by percent-demo-match to keep solved behaviors during finetuning.

Bootstrapping: seed a small code LM (e.g., CodeT5 220M) with a handful of DSL examples and run a few meta-iterations of sampling+training.

Agent Features

Memory

prioritized replay buffer (experience memory)

Tool Use

interpreter execution of sampled programs

Frameworks

Expert Iteration (ExIt)Hindsight Experience ReplayPrioritized Experience Replay

Architectures

encoder-decoder LLM policy (CodeT5+)

Optimization Features

Token Efficiency

sparse object-centric grid text representation to reduce token use

Training Optimization

prioritized sampling from replaylikelihood finetuning on synthetic experiencesmutation-based data augmentation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Qualcomm-AI-research/codeit https://github.com/michaelhodel/arc-dsl

Data URLs

https://github.com/michaelhodel/arc-dsl (DSL and solvers)ARC dataset (public benchmark)

Risks & Boundaries

Limitations

Requires a hand-designed DSL and an interpreter to evaluate candidate programs.

Needs an initial set of seed programs; learning tabula rasa is slower.

When Not To Use

If you lack a deterministic interpreter to run candidate programs.

When the domain has no compact DSL or symbolic semantics.

Failure Modes

Catastrophic forgetting when replay is uniform (no prioritization).

Stagnation if hindsight relabeling is disabled (too few positives).

Core Entities

Models

CodeT5+ (220M)CodeT5 60MCodeT5 770MMistral-7B (tested qualitatively)

Metrics

pass@3pass@1solved tasks (count)demonstration performance (percent solved demos)

Datasets

ARC (train 400, eval 400 / eval412)ConceptARC

Benchmarks

ARC evaluation setConceptARC

Context Entities

Models

GPT-4 (comparison baseline)text-davinci-003 (comparison)

Datasets

ARC hidden/competition splits (discussed)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CodeIt solves more ARC tasks than prior methods.

Hindsight relabeling substantially improves sample efficiency.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding