Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.
Summary TLDR
CodeIt is an expert-iteration method that trains a code-focused LLM to write programs for ARC tasks by alternating program sampling (with hindsight relabeling) and learning from a prioritized replay buffer. Using a 220M CodeT5+ model and a domain-specific language (DSL), CodeIt achieves state-of-the-art on the full ARC evaluation set (59/400 tasks solved). Ablations show hindsight relabeling and prioritized sampling drive most gains, while the approach still needs a DSL, an interpreter, and seed programs.
Problem Statement
ARC tasks give few input-output examples and produce extremely sparse rewards for program search. Existing neural and symbolic methods either do not learn across tasks or are sample inefficient. The challenge is to bootstrap program synthesis and inter-task generalization with limited data and rare positive signals.
Main Contribution
Code Iteration (CodeIt): alternate sampling programs, hindsight relabeling, and prioritized experience replay.
Use of a small pretrained code LLM (CodeT5+) with a sparse text grid representation to scale to the full ARC eval set.
State-of-the-art ARC results (59/400 tasks solved) and analyses showing program refinement and primitive learning patterns.
Systematic ablations highlighting the roles of relabeling, prioritized replay, pretraining, and mutation augmentation.
Key Findings
CodeIt solves more ARC tasks than prior methods.
Hindsight relabeling substantially improves sample efficiency.
Prioritized sampling reduces forgetting and helps policy performance.
Pretraining gives a strong head-start.
CodeIt refines solutions over time.
Performance increases with model size but has diminishing returns.
Results
ARC evaluation solved tasks (pass@3)
Policy-only performance
Effect of removing pretraining
Effect of removing priority
Who Should Care
What To Try In 7 Days
Implement hindsight relabeling: collect syntactically valid program outputs and relabel goals with realized outputs.
Add prioritized replay keyed by percent-demo-match to keep solved behaviors during finetuning.
Bootstrapping: seed a small code LM (e.g., CodeT5 220M) with a handful of DSL examples and run a few meta-iterations of sampling+training.
Agent Features
Memory
- prioritized replay buffer (experience memory)
Tool Use
- interpreter execution of sampled programs
Frameworks
- Expert Iteration (ExIt)
- Hindsight Experience Replay
- Prioritized Experience Replay
Architectures
- encoder-decoder LLM policy (CodeT5+)
Optimization Features
Token Efficiency
- sparse object-centric grid text representation to reduce token use
Training Optimization
- prioritized sampling from replay
- likelihood finetuning on synthetic experiences
- mutation-based data augmentation
Reproducibility
Data Urls
- https://github.com/michaelhodel/arc-dsl (DSL and solvers)
- ARC dataset (public benchmark)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a hand-designed DSL and an interpreter to evaluate candidate programs.
- Needs an initial set of seed programs; learning tabula rasa is slower.
- Struggles to find solutions when ground-truth programs exceed ~11 lines.
- Performs worse on numerical or pure-logic tasks compared to object-interaction tasks.
When Not To Use
- If you lack a deterministic interpreter to run candidate programs.
- When the domain has no compact DSL or symbolic semantics.
- For tasks that require very long programs or deep multi-step plans without intermediate supervision.
Failure Modes
- Catastrophic forgetting when replay is uniform (no prioritization).
- Stagnation if hindsight relabeling is disabled (too few positives).
- Overfitting large LMs with simple LoRA finetuning on small synthetic corpora.
Core Entities
Models
- CodeT5+ (220M)
- CodeT5 60M
- CodeT5 770M
- Mistral-7B (tested qualitatively)
Metrics
- pass@3
- pass@1
- solved tasks (count)
- demonstration performance (percent solved demos)
Datasets
- ARC (train 400, eval 400 / eval412)
- ConceptARC
Benchmarks
- ARC evaluation set
- ConceptARC
Context Entities
Models
- GPT-4 (comparison baseline)
- text-davinci-003 (comparison)
Datasets
- ARC hidden/competition splits (discussed)

