LLM agents that iteratively teach themselves to write ML library code for new hardware languages

February 4, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

Links

Abstract / PDF

Why It Matters For Business

This system can speed up early ML library development for new hardware by automating functional implementations from few examples, lowering time-to-ship and expert-hours spent.

Summary TLDR

The paper presents an agentic system and an adaptive self-improvement loop that lets LLMs generate ML library code for a new architecture-specific language (STeP) with very little example data. Key pieces: proposer + guardian LLM agents, a structural intermediate representation (IR), a fast functional simulator/verifier, and an adaptive curriculum that prioritizes hard-earned successful examples. On a 26-task STeP benchmark the system solves up to 96% of tasks and yields up to 3.9× improvement vs a baseline single LLM. The method focuses on functional correctness under limited examples, not low-level performance tuning.

Problem Statement

Writing high-performance ML libraries for new domain-specific hardware languages requires deep ML and ASPL expertise. Examples are scarce during early hardware design, so automated code generation must reason deeply from limited data to produce correct library implementations.

Main Contribution

An adaptive self-improvement learning algorithm that iteratively collects and prioritizes high-quality, self-generated examples to evolve LLM agent performance.

A practical agentic system design (proposer + guardian + code generator + verifiers + structural IR) tuned to generate STeP ASPL ML operators.

A realistic 26-task benchmark in STeP and empirical results showing large gains in task completion and token efficiency vs single-model baselines.

Key Findings

Adaptive agentic system solved most benchmark tasks and greatly increased pass rates.

NumbersPass@n up to 0.96 (96%) on the benchmark

Self-improvement yields large gains over single LLM baselines.

NumbersUp to 3.9× improvement vs single LLM baseline

Adaptive granularity m=3 balanced token cost and performance.

Numbersm=3 saves 1.07× tokens vs m=4 and gives 1.5× perf vs m=1

Self-improvement outperformes supervised finetuning on these limited-data tasks.

NumbersSFT final Pass@n 0.62 vs self-improved 0.81

Results

Pass@n (Claude-3-5-sonnet, agentic self-improved)

Value0.885 avg; up to 1.0 on many groups

BaselineClaude single model lower (see Table 4)

Pass@n improvement (GPT-4o with self-improvement)

Value0.23 → 0.81

BaselineGPT-4o single

SFT

ValueFinetuned final 0.62

BaselineSelf-improved final 0.81

Average wall time per task

Value< 10 minutes

Who Should Care

What To Try In 7 Days

Prototype a fast functional verifier for your target domain (unit tests + static checks).

Define a compact structural IR that captures tasks and reduces prompt tokens.

Run a 2-agent setup: a proposer LLM plus a verifier/guardian that checks global constraints.

Agent Features

Memory

  • earned-experience buffer D (stores filtered correct solutions)
  • stratified demonstration selection (difficulty bins)

Planning

  • adaptive self-improvement loop (iterative sampling and selection)
  • difficulty-stratified curriculum (hard→mixed→easy)

Tool Use

  • verifier/simulator (functional tests)
  • code generator to pytestable Python
  • AST-based filtering and grouping

Frameworks

  • structural intermediate representation (IR)
  • AST isomorphism for deduplication

Is Agentic

true

Architectures

  • proposer + guardian (two-agent organization)
  • structural IR mediated pipeline

Collaboration

  • multi-agent coordination between proposer and guardian
  • central controller orchestrates sampling and demonstration selection

Optimization Features

Token Efficiency

  • structural IR reduces prompt redundancy
  • adaptive granularity m=3 saved 1.07× tokens vs m=4

System Optimization

  • parallel sampling of candidates to collect experiences quickly
  • filtering to keep compact representative solutions

Training Optimization

  • self-generated dataset to improve model behavior instead of full finetuning
  • task-level reward and selection instead of token-level reward

Inference Optimization

  • adaptive test-time compute (more samples for harder tasks)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on functional correctness; it does not guarantee hardware-level performance or optimizations.
  • Requires a fast, accurate verifier/simulator to score candidates—building that can be costly.
  • Guardian agent can sometimes corrupt correct proposer outputs (noted decrease in Pass@1 in some splits).
  • Filtering heuristics (e.g., minimal code length) can strip explanatory comments or useful reasoning traces.

When Not To Use

  • When no reliable automatic verifier or unit tests exist for the target task.
  • When the primary goal is low-level performance tuning rather than functional correctness.
  • When abundant labeled data exists and supervised finetuning is already feasible.

Failure Modes

  • Guardian fixes can modify correct solutions, reducing single-sample success (guardian corruption).
  • AST-based deduplication may drop semantically useful variants or chain-of-thought hints.
  • Simulator mismatch: unit-test shapes may not catch bugs that appear on real hardware.
  • Overfitting to oracle test inputs and shapes if verifier uses limited input distributions.

Core Entities

Models

  • Claude-3-5-sonnet
  • gpt-4o
  • DeepSeek-V3
  • Llama-3.1-405B
  • Qwen2.5-Coder-32B
  • OpenAI-o1

Metrics

  • Pass@1
  • Pass@n
  • Input token count
  • Maintenance index (MI and MIwoc)

Datasets

  • STeP 26-task ML-operator benchmark (constructed by authors)
  • AIME-2024 (small generalization test)

Benchmarks

  • Pass@k on 26-task STeP benchmark