LLM agents that iteratively teach themselves to write ML library code for new hardware languages

February 4, 20258 min

Overview

Decision SnapshotNeeds Validation

The method is practically useful for functional code generation under scarce examples; it needs a reliable verifier and domain adaptation before production deployment.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

Links

Abstract / PDF / Code

Why It Matters For Business

This system can speed up early ML library development for new hardware by automating functional implementations from few examples, lowering time-to-ship and expert-hours spent.

Who Should Care

Summary TLDR

The paper presents an agentic system and an adaptive self-improvement loop that lets LLMs generate ML library code for a new architecture-specific language (STeP) with very little example data. Key pieces: proposer + guardian LLM agents, a structural intermediate representation (IR), a fast functional simulator/verifier, and an adaptive curriculum that prioritizes hard-earned successful examples. On a 26-task STeP benchmark the system solves up to 96% of tasks and yields up to 3.9× improvement vs a baseline single LLM. The method focuses on functional correctness under limited examples, not low-level performance tuning.

Problem Statement

Writing high-performance ML libraries for new domain-specific hardware languages requires deep ML and ASPL expertise. Examples are scarce during early hardware design, so automated code generation must reason deeply from limited data to produce correct library implementations.

Main Contribution

An adaptive self-improvement learning algorithm that iteratively collects and prioritizes high-quality, self-generated examples to evolve LLM agent performance.

A practical agentic system design (proposer + guardian + code generator + verifiers + structural IR) tuned to generate STeP ASPL ML operators.

Key Findings

Adaptive agentic system solved most benchmark tasks and greatly increased pass rates.

NumbersPass@n up to 0.96 (96%) on the benchmark

Practical UseUse the self-improvement loop to raise functional completion for new ASPL library tasks; expect most tasks to become solvable where single LLMs fail.

Evidence RefFigure 2; Table 4; Section 6

Self-improvement yields large gains over single LLM baselines.

NumbersUp to 3.9× improvement vs single LLM baseline

Practical UseIf a single LLM gives low pass@k, run the adaptive self-improvement pipeline before heavier options like large-scale finetuning.

Evidence RefAbstract; Figure 2; Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@n (Claude-3-5-sonnet, agentic self-improved)0.885 avg; up to 1.0 on many groupsClaude single model lower (see Table 4)From 0.730.96 (Table 5)26-task STeP benchmark (avg over groups)Table 4; Table 5Table 4; Table 5
Pass@n improvement (GPT-4o with self-improvement)0.230.81GPT-4o single≈ +0.58 absolute26-task STeP benchmarkTable 5 (Section A.4)Table 5

What To Try In 7 Days

Prototype a fast functional verifier for your target domain (unit tests + static checks).

Define a compact structural IR that captures tasks and reduces prompt tokens.

Run a 2-agent setup: a proposer LLM plus a verifier/guardian that checks global constraints.

Agent Features

Memory
earned-experience buffer D (stores filtered correct solutions)stratified demonstration selection (difficulty bins)
Planning
adaptive self-improvement loop (iterative sampling and selection)difficulty-stratified curriculum (hard→mixed→easy)
Tool Use
verifier/simulator (functional tests)code generator to pytestable PythonAST-based filtering and grouping
Frameworks
structural intermediate representation (IR)AST isomorphism for deduplication
Is Agentic

Yes

Architectures
proposer + guardian (two-agent organization)structural IR mediated pipeline
Collaboration
multi-agent coordination between proposer and guardiancentral controller orchestrates sampling and demonstration selection

Optimization Features

Token Efficiency
structural IR reduces prompt redundancyadaptive granularity m=3 saved 1.07× tokens vs m=4
System Optimization
parallel sampling of candidates to collect experiences quicklyfiltering to keep compact representative solutions
Training Optimization
self-generated dataset to improve model behavior instead of full finetuningtask-level reward and selection instead of token-level reward
Inference Optimization
adaptive test-time compute (more samples for harder tasks)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses on functional correctness; it does not guarantee hardware-level performance or optimizations.

Requires a fast, accurate verifier/simulator to score candidates—building that can be costly.

When Not To Use

When no reliable automatic verifier or unit tests exist for the target task.

When the primary goal is low-level performance tuning rather than functional correctness.

Failure Modes

Guardian fixes can modify correct solutions, reducing single-sample success (guardian corruption).

AST-based deduplication may drop semantically useful variants or chain-of-thought hints.

Core Entities

Models

Claude-3-5-sonnetgpt-4oDeepSeek-V3Llama-3.1-405BQwen2.5-Coder-32BOpenAI-o1

Metrics

Pass@1Pass@nInput token countMaintenance index (MI and MIwoc)

Datasets

STeP 26-task ML-operator benchmark (constructed by authors)AIME-2024 (small generalization test)

Benchmarks

Pass@k on 26-task STeP benchmark