Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.55
Citation Count
0
Why It Matters For Business
StitchCUDA automates full end-to-end CUDA program generation and tuning, turning PyTorch references into verified, faster GPU code—so teams can reduce manual GPU engineering time and get measurable runtime gains on complex workloads.
Summary TLDR
StitchCUDA is a three-agent system (Planner, Coder, Verifier) that generates and optimizes full end-to-end CUDA programs from PyTorch references. It trains the Coder with a rubric-shaped reinforcement learning objective split into two single-turn skills (from-scratch generation and feedback-driven optimization). On KernelBench Level 3 (end-to-end tasks) StitchCUDA reaches near-100% success and delivers measurable system-level speedups vs baselines while reducing training rollout cost by orders of magnitude.
Problem Statement
Existing LLM approaches focus on single GPU kernels and struggle to produce correct, high-performing end-to-end GPU programs, because program-level choices (kernel fusion, host orchestration, data movement) and coder responsiveness to profiling feedback are not handled by one-shot generation or naive RL.
Main Contribution
A multi-agent workflow (Planner / Coder / Verifier) that coordinates profiling, system-level planning, code generation, and profiling-driven refinement for end-to-end GPU programs.
A rubric-based agentic RL recipe that trains the Coder on two atomic single-turn skills (from-scratch generation and feedback-driven optimization) to avoid costly multi-turn rollouts.
Practical engineering: integration with Nsight Systems/Compute, a RAG of NVIDIA docs, LoRA fine-tuning of Qwen3-32B, and empirically validated anti-hacking rubric shaping.
Key Findings
StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.
Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.
Decomposing multi-turn agentic RL into two single-turn skills cuts training compute by orders of magnitude.
Multi-agent orchestration (Planner+Verifier loop) improves end-to-end correctness and enables speedups even without RL.
Results
Level 3 correctness
E2E average speedup (H200, Level 3)
Hacking incidents (partial / total)
Training compute (H200-hours)
Who Should Care
What To Try In 7 Days
Run Planner+Verifier loop on a small model: profile PyTorch reference with Nsight and extract a simple plan.
Use the multi-agent loop (Planner→Coder→Verifier) to iterate one end-to-end tensor workload and inspect profiler-guided suggestions.
Add a simple rubric or checklist to any RL or local search reward to penalize copying PyTorch-only code and encourage true kernel work.
Agent Features
Memory
- Shared typed State (code, traces, routing decisions)
- Persistent per-stream workspace (runtime optimization)
Planning
- System-level task decomposition
- Chain-of-thought planning for fusion and host orchestration
Tool Use
- Nsight Systems (Nsys) for system hotspots
- Nsight Compute (NCU) for kernel-level metrics
- RAG over NVIDIA docs for API/usage
Frameworks
- GRPO
- LoRA
Is Agentic
true
Architectures
- Planner / Coder / Verifier multi-agent loop
- Global typed State for routing
Collaboration
- Iterative plan-code-profile-refine loop
- Routing decisions between agents (coding, replan, next task)
Optimization Features
Token Efficiency
- Max response length 16384 during RL rollouts
Infra Optimization
- Training measured in H200-hours and designed to reduce rollout cost
- Use of 4 H200 GPUs for training
System Optimization
- Kernel fusion (cuBLASLt epilogues)
- Host-side orchestration (memory allocation, CPU-GPU overlap)
- Data layout and pinned memory for transfers
Training Optimization
- LoRA
- Single-turn skill decomposition for RL
Inference Optimization
- Cached cuBLASLt descriptors and heuristics
- Persistent per-stream workspace
- Mixed precision (fp16 compute with fp32 accumulation)
Reproducibility
Data Urls
- KernelBench (referenced as Ouyang et al., 2025)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Results evaluated on KernelBench with manual fixes; real-world workloads may differ.
- Relies on large closed models (GPT-5.2, Qwen3-32B) and high-end GPUs (H200) for training/metrics.
- RAG and rubric scoring depend on curated documents and LLM rubric assigner; risk of missed hacking cases.
When Not To Use
- For single-kernel microbenchmarks where existing kernel tools already work well.
- When you lack access to large LLMs or multi-GPU training budget.
- If legal or IP constraints prevent automated retrieval of vendor docs or code generation.
Failure Modes
- Reward hacking: models return PyTorch-only code or hardcoded outputs unless rubric catches it.
- Degenerate conservative edits that leave critical kernels unchanged and yield small speedups.
- Compilation or environment mismatches leading to repeated failed iterations.
Core Entities
Models
- Qwen3-32B
- Kevin-32B
- GPT-5.2
- Claude-4-sonnet
- Qwen3
Metrics
- Success Rate
- E2E Average Speedup
- Fast 1
Datasets
- KernelBench
Benchmarks
- KernelBench Level 1
- KernelBench Level 2
- KernelBench Level 3

