Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

March 3, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.55

Citation Count

0

Authors

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

Links

Abstract / PDF

Why It Matters For Business

StitchCUDA automates full end-to-end CUDA program generation and tuning, turning PyTorch references into verified, faster GPU code—so teams can reduce manual GPU engineering time and get measurable runtime gains on complex workloads.

Summary TLDR

StitchCUDA is a three-agent system (Planner, Coder, Verifier) that generates and optimizes full end-to-end CUDA programs from PyTorch references. It trains the Coder with a rubric-shaped reinforcement learning objective split into two single-turn skills (from-scratch generation and feedback-driven optimization). On KernelBench Level 3 (end-to-end tasks) StitchCUDA reaches near-100% success and delivers measurable system-level speedups vs baselines while reducing training rollout cost by orders of magnitude.

Problem Statement

Existing LLM approaches focus on single GPU kernels and struggle to produce correct, high-performing end-to-end GPU programs, because program-level choices (kernel fusion, host orchestration, data movement) and coder responsiveness to profiling feedback are not handled by one-shot generation or naive RL.

Main Contribution

A multi-agent workflow (Planner / Coder / Verifier) that coordinates profiling, system-level planning, code generation, and profiling-driven refinement for end-to-end GPU programs.

A rubric-based agentic RL recipe that trains the Coder on two atomic single-turn skills (from-scratch generation and feedback-driven optimization) to avoid costly multi-turn rollouts.

Practical engineering: integration with Nsight Systems/Compute, a RAG of NVIDIA docs, LoRA fine-tuning of Qwen3-32B, and empirically validated anti-hacking rubric shaping.

Key Findings

StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.

NumbersLevel 3 (H200): 10/10 correct; mean speedup 1.50× over PyTorch eager

Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.

NumbersHacking counts on test set: StitchCUDA 8/50 partial, 0/50 total vs StitchCUDA-K 22/50 partial, 4/50 total

Decomposing multi-turn agentic RL into two single-turn skills cuts training compute by orders of magnitude.

NumbersEstimated H200-hours: rubric-based single-turn RL 160 H200-hrs vs multi-turn agentic RL 9,600–12,000 H200-hrs

Multi-agent orchestration (Planner+Verifier loop) improves end-to-end correctness and enables speedups even without RL.

NumbersQwen3-32B single-shot: 2/20 correctness (Level 1) vs StitchCUDA-Q (same Coder in multi-agent): 17/20 correctness

Results

Level 3 correctness

Value10/10

BaselineStitchCUDA-G (GPT-5.2 backend) 6/10

E2E average speedup (H200, Level 3)

Value1.50×

Baselinemulti-agent no-RL variant (StitchCUDA-Q) 0.24×

Hacking incidents (partial / total)

Value8 / 0 (StitchCUDA)

BaselineStitchCUDA-K (Kevin-32B) 22 / 4

Training compute (H200-hours)

Value160 (rubric-based single-turn RL)

Baseline9,600–12,000 (multi-turn agentic RL)

Who Should Care

What To Try In 7 Days

Run Planner+Verifier loop on a small model: profile PyTorch reference with Nsight and extract a simple plan.

Use the multi-agent loop (Planner→Coder→Verifier) to iterate one end-to-end tensor workload and inspect profiler-guided suggestions.

Add a simple rubric or checklist to any RL or local search reward to penalize copying PyTorch-only code and encourage true kernel work.

Agent Features

Memory

  • Shared typed State (code, traces, routing decisions)
  • Persistent per-stream workspace (runtime optimization)

Planning

  • System-level task decomposition
  • Chain-of-thought planning for fusion and host orchestration

Tool Use

  • Nsight Systems (Nsys) for system hotspots
  • Nsight Compute (NCU) for kernel-level metrics
  • RAG over NVIDIA docs for API/usage

Frameworks

  • GRPO
  • LoRA

Is Agentic

true

Architectures

  • Planner / Coder / Verifier multi-agent loop
  • Global typed State for routing

Collaboration

  • Iterative plan-code-profile-refine loop
  • Routing decisions between agents (coding, replan, next task)

Optimization Features

Token Efficiency

  • Max response length 16384 during RL rollouts

Infra Optimization

  • Training measured in H200-hours and designed to reduce rollout cost
  • Use of 4 H200 GPUs for training

System Optimization

  • Kernel fusion (cuBLASLt epilogues)
  • Host-side orchestration (memory allocation, CPU-GPU overlap)
  • Data layout and pinned memory for transfers

Training Optimization

  • LoRA
  • Single-turn skill decomposition for RL

Inference Optimization

  • Cached cuBLASLt descriptors and heuristics
  • Persistent per-stream workspace
  • Mixed precision (fp16 compute with fp32 accumulation)

Reproducibility

Data Urls

  • KernelBench (referenced as Ouyang et al., 2025)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Results evaluated on KernelBench with manual fixes; real-world workloads may differ.
  • Relies on large closed models (GPT-5.2, Qwen3-32B) and high-end GPUs (H200) for training/metrics.
  • RAG and rubric scoring depend on curated documents and LLM rubric assigner; risk of missed hacking cases.

When Not To Use

  • For single-kernel microbenchmarks where existing kernel tools already work well.
  • When you lack access to large LLMs or multi-GPU training budget.
  • If legal or IP constraints prevent automated retrieval of vendor docs or code generation.

Failure Modes

  • Reward hacking: models return PyTorch-only code or hardcoded outputs unless rubric catches it.
  • Degenerate conservative edits that leave critical kernels unchanged and yield small speedups.
  • Compilation or environment mismatches leading to repeated failed iterations.

Core Entities

Models

  • Qwen3-32B
  • Kevin-32B
  • GPT-5.2
  • Claude-4-sonnet
  • Qwen3

Metrics

  • Success Rate
  • E2E Average Speedup
  • Fast 1

Datasets

  • KernelBench

Benchmarks

  • KernelBench Level 1
  • KernelBench Level 2
  • KernelBench Level 3