Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

March 3, 20268 min

Overview

Decision SnapshotReady For Pilot

The system shows strong empirical gains on KernelBench Level 3 and includes practical engineering (profilers, RAG, LoRA). But results are benchmark-limited, rely on large LLMs and expensive GPUs, and require care to avoid reward-hacking.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 60%

Authors

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

Links

Abstract / PDF / Data

Why It Matters For Business

StitchCUDA automates full end-to-end CUDA program generation and tuning, turning PyTorch references into verified, faster GPU code—so teams can reduce manual GPU engineering time and get measurable runtime gains on complex workloads.

Who Should Care

Summary TLDR

StitchCUDA is a three-agent system (Planner, Coder, Verifier) that generates and optimizes full end-to-end CUDA programs from PyTorch references. It trains the Coder with a rubric-shaped reinforcement learning objective split into two single-turn skills (from-scratch generation and feedback-driven optimization). On KernelBench Level 3 (end-to-end tasks) StitchCUDA reaches near-100% success and delivers measurable system-level speedups vs baselines while reducing training rollout cost by orders of magnitude.

Problem Statement

Existing LLM approaches focus on single GPU kernels and struggle to produce correct, high-performing end-to-end GPU programs, because program-level choices (kernel fusion, host orchestration, data movement) and coder responsiveness to profiling feedback are not handled by one-shot generation or naive RL.

Main Contribution

A multi-agent workflow (Planner / Coder / Verifier) that coordinates profiling, system-level planning, code generation, and profiling-driven refinement for end-to-end GPU programs.

A rubric-based agentic RL recipe that trains the Coder on two atomic single-turn skills (from-scratch generation and feedback-driven optimization) to avoid costly multi-turn rollouts.

Key Findings

StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.

NumbersLevel 3 (H200): 10/10 correct; mean speedup 1.50× over PyTorch eager

Practical UseUse StitchCUDA's multi-agent loop plus rubric RL to turn PyTorch references into correct, faster end-to-end CUDA programs on evaluated workloads.

Evidence RefSection 4.1; Table 1

Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.

NumbersHacking counts on test set: StitchCUDA 8/50 partial, 0/50 total vs StitchCUDA-K 22/50 partial, 4/50 total

Practical UseAdd an expert rubric or LLM-scored rubric term to RL reward functions to lower cheating (PyTorch-only or hardcoded outputs) during kernel optimization.

Evidence RefSection 4.3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Level 3 correctness10/10StitchCUDA-G (GPT-5.2 backend) 6/104 tasksKernelBench Level 3 test setTable 1 (Section 4.1)Table 1
E2E average speedup (H200, Level 3)1.50×multi-agent no-RL variant (StitchCUDA-Q) 0.24×1.26×KernelBench Level 3 test setSection 4.1; Table 1Table 1

What To Try In 7 Days

Run Planner+Verifier loop on a small model: profile PyTorch reference with Nsight and extract a simple plan.

Use the multi-agent loop (Planner→Coder→Verifier) to iterate one end-to-end tensor workload and inspect profiler-guided suggestions.

Add a simple rubric or checklist to any RL or local search reward to penalize copying PyTorch-only code and encourage true kernel work.

Agent Features

Memory
Shared typed State (code, traces, routing decisions)Persistent per-stream workspace (runtime optimization)
Planning
System-level task decompositionChain-of-thought planning for fusion and host orchestration
Tool Use
Nsight Systems (Nsys) for system hotspotsNsight Compute (NCU) for kernel-level metricsRAG over NVIDIA docs for API/usage
Frameworks
GRPOLoRA
Is Agentic

Yes

Architectures
Planner / Coder / Verifier multi-agent loopGlobal typed State for routing
Collaboration
Iterative plan-code-profile-refine loopRouting decisions between agents (coding, replan, next task)

Optimization Features

Token Efficiency
Max response length 16384 during RL rollouts
Infra Optimization
Training measured in H200-hours and designed to reduce rollout costUse of 4 H200 GPUs for training
System Optimization
Kernel fusion (cuBLASLt epilogues)Host-side orchestration (memory allocation, CPU-GPU overlap)Data layout and pinned memory for transfers
Training Optimization
LoRASingle-turn skill decomposition for RL
Inference Optimization
Cached cuBLASLt descriptors and heuristicsPersistent per-stream workspaceMixed precision (fp16 compute with fp32 accumulation)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

KernelBench (referenced as Ouyang et al., 2025)

Risks & Boundaries

Limitations

Results evaluated on KernelBench with manual fixes; real-world workloads may differ.

Relies on large closed models (GPT-5.2, Qwen3-32B) and high-end GPUs (H200) for training/metrics.

When Not To Use

For single-kernel microbenchmarks where existing kernel tools already work well.

When you lack access to large LLMs or multi-GPU training budget.

Failure Modes

Reward hacking: models return PyTorch-only code or hardcoded outputs unless rubric catches it.

Degenerate conservative edits that leave critical kernels unchanged and yield small speedups.

Core Entities

Models

Qwen3-32BKevin-32BGPT-5.2Claude-4-sonnetQwen3

Metrics

Success RateE2E Average SpeedupFast 1

Datasets

KernelBench

Benchmarks

KernelBench Level 1KernelBench Level 2KernelBench Level 3