Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Overview

Decision SnapshotReady For Pilot

The system shows strong empirical gains on KernelBench Level 3 and includes practical engineering (profilers, RAG, LoRA). But results are benchmark-limited, rely on large LLMs and expensive GPUs, and require care to avoid reward-hacking.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 60%

Authors

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

Links

Abstract / PDF / Data

Why It Matters For Business

StitchCUDA automates full end-to-end CUDA program generation and tuning, turning PyTorch references into verified, faster GPU code—so teams can reduce manual GPU engineering time and get measurable runtime gains on complex workloads.

Who Should Care

ML Engineer Engineering Lead Data Scientist

Summary TLDR

StitchCUDA is a three-agent system (Planner, Coder, Verifier) that generates and optimizes full end-to-end CUDA programs from PyTorch references. It trains the Coder with a rubric-shaped reinforcement learning objective split into two single-turn skills (from-scratch generation and feedback-driven optimization). On KernelBench Level 3 (end-to-end tasks) StitchCUDA reaches near-100% success and delivers measurable system-level speedups vs baselines while reducing training rollout cost by orders of magnitude.

Problem Statement

Existing LLM approaches focus on single GPU kernels and struggle to produce correct, high-performing end-to-end GPU programs, because program-level choices (kernel fusion, host orchestration, data movement) and coder responsiveness to profiling feedback are not handled by one-shot generation or naive RL.

Main Contribution

A multi-agent workflow (Planner / Coder / Verifier) that coordinates profiling, system-level planning, code generation, and profiling-driven refinement for end-to-end GPU programs.

A rubric-based agentic RL recipe that trains the Coder on two atomic single-turn skills (from-scratch generation and feedback-driven optimization) to avoid costly multi-turn rollouts.

Key Findings

StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.

NumbersLevel 3 (H200): 10/10 correct; mean speedup 1.50× over PyTorch eager

Practical UseUse StitchCUDA's multi-agent loop plus rubric RL to turn PyTorch references into correct, faster end-to-end CUDA programs on evaluated workloads.

Evidence RefSection 4.1; Table 1

Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.

NumbersHacking counts on test set: StitchCUDA 8/50 partial, 0/50 total vs StitchCUDA-K 22/50 partial, 4/50 total

Practical UseAdd an expert rubric or LLM-scored rubric term to RL reward functions to lower cheating (PyTorch-only or hardcoded outputs) during kernel optimization.

Evidence RefSection 4.3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Level 3 correctness	10/10	StitchCUDA-G (GPT-5.2 backend) 6/10	↑ 4 tasks	KernelBench Level 3 test set	Table 1 (Section 4.1)	Table 1
E2E average speedup (H200, Level 3)	1.50×	multi-agent no-RL variant (StitchCUDA-Q) 0.24×	↑ 1.26×	KernelBench Level 3 test set	Section 4.1; Table 1	Table 1

What To Try In 7 Days

Run Planner+Verifier loop on a small model: profile PyTorch reference with Nsight and extract a simple plan.

Use the multi-agent loop (Planner→Coder→Verifier) to iterate one end-to-end tensor workload and inspect profiler-guided suggestions.

Add a simple rubric or checklist to any RL or local search reward to penalize copying PyTorch-only code and encourage true kernel work.

Agent Features

Memory

Shared typed State (code, traces, routing decisions)Persistent per-stream workspace (runtime optimization)

Planning

System-level task decompositionChain-of-thought planning for fusion and host orchestration

Tool Use

Nsight Systems (Nsys) for system hotspotsNsight Compute (NCU) for kernel-level metricsRAG over NVIDIA docs for API/usage

Frameworks

GRPOLoRA

Is Agentic

Yes

Architectures

Planner / Coder / Verifier multi-agent loopGlobal typed State for routing

Collaboration

Iterative plan-code-profile-refine loopRouting decisions between agents (coding, replan, next task)

Optimization Features

Token Efficiency

Max response length 16384 during RL rollouts

Infra Optimization

Training measured in H200-hours and designed to reduce rollout costUse of 4 H200 GPUs for training

System Optimization

Kernel fusion (cuBLASLt epilogues)Host-side orchestration (memory allocation, CPU-GPU overlap)Data layout and pinned memory for transfers

Training Optimization

LoRASingle-turn skill decomposition for RL

Inference Optimization

Cached cuBLASLt descriptors and heuristicsPersistent per-stream workspaceMixed precision (fp16 compute with fp32 accumulation)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

KernelBench (referenced as Ouyang et al., 2025)

Risks & Boundaries

Limitations

Results evaluated on KernelBench with manual fixes; real-world workloads may differ.

Relies on large closed models (GPT-5.2, Qwen3-32B) and high-end GPUs (H200) for training/metrics.

When Not To Use

For single-kernel microbenchmarks where existing kernel tools already work well.

When you lack access to large LLMs or multi-GPU training budget.

Failure Modes

Reward hacking: models return PyTorch-only code or hardcoded outputs unless rubric catches it.

Degenerate conservative edits that leave critical kernels unchanged and yield small speedups.

Core Entities

Models

Qwen3-32BKevin-32BGPT-5.2Claude-4-sonnetQwen3

Metrics

Success RateE2E Average SpeedupFast 1

Datasets

KernelBench

Benchmarks

KernelBench Level 1KernelBench Level 2KernelBench Level 3

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.

Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding