Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
This system can speed up early ML library development for new hardware by automating functional implementations from few examples, lowering time-to-ship and expert-hours spent.
Summary TLDR
The paper presents an agentic system and an adaptive self-improvement loop that lets LLMs generate ML library code for a new architecture-specific language (STeP) with very little example data. Key pieces: proposer + guardian LLM agents, a structural intermediate representation (IR), a fast functional simulator/verifier, and an adaptive curriculum that prioritizes hard-earned successful examples. On a 26-task STeP benchmark the system solves up to 96% of tasks and yields up to 3.9× improvement vs a baseline single LLM. The method focuses on functional correctness under limited examples, not low-level performance tuning.
Problem Statement
Writing high-performance ML libraries for new domain-specific hardware languages requires deep ML and ASPL expertise. Examples are scarce during early hardware design, so automated code generation must reason deeply from limited data to produce correct library implementations.
Main Contribution
An adaptive self-improvement learning algorithm that iteratively collects and prioritizes high-quality, self-generated examples to evolve LLM agent performance.
A practical agentic system design (proposer + guardian + code generator + verifiers + structural IR) tuned to generate STeP ASPL ML operators.
A realistic 26-task benchmark in STeP and empirical results showing large gains in task completion and token efficiency vs single-model baselines.
Key Findings
Adaptive agentic system solved most benchmark tasks and greatly increased pass rates.
Self-improvement yields large gains over single LLM baselines.
Adaptive granularity m=3 balanced token cost and performance.
Self-improvement outperformes supervised finetuning on these limited-data tasks.
Results
Pass@n (Claude-3-5-sonnet, agentic self-improved)
Pass@n improvement (GPT-4o with self-improvement)
SFT
Average wall time per task
Who Should Care
What To Try In 7 Days
Prototype a fast functional verifier for your target domain (unit tests + static checks).
Define a compact structural IR that captures tasks and reduces prompt tokens.
Run a 2-agent setup: a proposer LLM plus a verifier/guardian that checks global constraints.
Agent Features
Memory
- earned-experience buffer D (stores filtered correct solutions)
- stratified demonstration selection (difficulty bins)
Planning
- adaptive self-improvement loop (iterative sampling and selection)
- difficulty-stratified curriculum (hard→mixed→easy)
Tool Use
- verifier/simulator (functional tests)
- code generator to pytestable Python
- AST-based filtering and grouping
Frameworks
- structural intermediate representation (IR)
- AST isomorphism for deduplication
Is Agentic
true
Architectures
- proposer + guardian (two-agent organization)
- structural IR mediated pipeline
Collaboration
- multi-agent coordination between proposer and guardian
- central controller orchestrates sampling and demonstration selection
Optimization Features
Token Efficiency
- structural IR reduces prompt redundancy
- adaptive granularity m=3 saved 1.07× tokens vs m=4
System Optimization
- parallel sampling of candidates to collect experiences quickly
- filtering to keep compact representative solutions
Training Optimization
- self-generated dataset to improve model behavior instead of full finetuning
- task-level reward and selection instead of token-level reward
Inference Optimization
- adaptive test-time compute (more samples for harder tasks)
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on functional correctness; it does not guarantee hardware-level performance or optimizations.
- Requires a fast, accurate verifier/simulator to score candidates—building that can be costly.
- Guardian agent can sometimes corrupt correct proposer outputs (noted decrease in Pass@1 in some splits).
- Filtering heuristics (e.g., minimal code length) can strip explanatory comments or useful reasoning traces.
When Not To Use
- When no reliable automatic verifier or unit tests exist for the target task.
- When the primary goal is low-level performance tuning rather than functional correctness.
- When abundant labeled data exists and supervised finetuning is already feasible.
Failure Modes
- Guardian fixes can modify correct solutions, reducing single-sample success (guardian corruption).
- AST-based deduplication may drop semantically useful variants or chain-of-thought hints.
- Simulator mismatch: unit-test shapes may not catch bugs that appear on real hardware.
- Overfitting to oracle test inputs and shapes if verifier uses limited input distributions.
Core Entities
Models
- Claude-3-5-sonnet
- gpt-4o
- DeepSeek-V3
- Llama-3.1-405B
- Qwen2.5-Coder-32B
- OpenAI-o1
Metrics
- Pass@1
- Pass@n
- Input token count
- Maintenance index (MI and MIwoc)
Datasets
- STeP 26-task ML-operator benchmark (constructed by authors)
- AIME-2024 (small generalization test)
Benchmarks
- Pass@k on 26-task STeP benchmark

