Overview
The method is practical for DSL-backed program synthesis and small-model deployments, but it depends on a hand-crafted DSL, an interpreter, and seed programs; expect moderate engineering effort to adapt to new domains.
Citations1
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 70%
Why It Matters For Business
CodeIt shows small, open code LMs can be iteratively improved to solve nontrivial program synthesis tasks, lowering dependency on expensive huge-LM APIs and enabling automated DSL-based tools for constrained domains.
Who Should Care
Summary TLDR
CodeIt is an expert-iteration method that trains a code-focused LLM to write programs for ARC tasks by alternating program sampling (with hindsight relabeling) and learning from a prioritized replay buffer. Using a 220M CodeT5+ model and a domain-specific language (DSL), CodeIt achieves state-of-the-art on the full ARC evaluation set (59/400 tasks solved). Ablations show hindsight relabeling and prioritized sampling drive most gains, while the approach still needs a DSL, an interpreter, and seed programs.
Problem Statement
ARC tasks give few input-output examples and produce extremely sparse rewards for program search. Existing neural and symbolic methods either do not learn across tasks or are sample inefficient. The challenge is to bootstrap program synthesis and inter-task generalization with limited data and rare positive signals.
Main Contribution
Code Iteration (CodeIt): alternate sampling programs, hindsight relabeling, and prioritized experience replay.
Use of a small pretrained code LLM (CodeT5+) with a sparse text grid representation to scale to the full ARC eval set.
Key Findings
CodeIt solves more ARC tasks than prior methods.
Hindsight relabeling substantially improves sample efficiency.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ARC evaluation solved tasks (pass@3) | 59/400 | Ainooson et al. 26/400; Ferré 23/400 | better than prior SOTA on full eval set | ARC Eval (400) | Table 1 shows CodeIt 59/400 vs prior symbolic/neural baselines | Table 1, Sec. 3.2 |
| Policy-only performance | 49/400 | CodeIt cumulative 59/400 | policy produces correct program in current meta-iteration for 49 tasks | ARC Eval | Table 2 (policy only perf.) | Table 2, Sec. 3.3 |
What To Try In 7 Days
Implement hindsight relabeling: collect syntactically valid program outputs and relabel goals with realized outputs.
Add prioritized replay keyed by percent-demo-match to keep solved behaviors during finetuning.
Bootstrapping: seed a small code LM (e.g., CodeT5 220M) with a handful of DSL examples and run a few meta-iterations of sampling+training.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires a hand-designed DSL and an interpreter to evaluate candidate programs.
Needs an initial set of seed programs; learning tabula rasa is slower.
When Not To Use
If you lack a deterministic interpreter to run candidate programs.
When the domain has no compact DSL or symbolic semantics.
Failure Modes
Catastrophic forgetting when replay is uniform (no prioritization).
Stagnation if hindsight relabeling is disabled (too few positives).

