Overview
The system shows strong empirical gains on KernelBench Level 3 and includes practical engineering (profilers, RAG, LoRA). But results are benchmark-limited, rely on large LLMs and expensive GPUs, and require care to avoid reward-hacking.
Citations0
Evidence Strength0.80
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 55%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
StitchCUDA automates full end-to-end CUDA program generation and tuning, turning PyTorch references into verified, faster GPU code—so teams can reduce manual GPU engineering time and get measurable runtime gains on complex workloads.
Who Should Care
Summary TLDR
StitchCUDA is a three-agent system (Planner, Coder, Verifier) that generates and optimizes full end-to-end CUDA programs from PyTorch references. It trains the Coder with a rubric-shaped reinforcement learning objective split into two single-turn skills (from-scratch generation and feedback-driven optimization). On KernelBench Level 3 (end-to-end tasks) StitchCUDA reaches near-100% success and delivers measurable system-level speedups vs baselines while reducing training rollout cost by orders of magnitude.
Problem Statement
Existing LLM approaches focus on single GPU kernels and struggle to produce correct, high-performing end-to-end GPU programs, because program-level choices (kernel fusion, host orchestration, data movement) and coder responsiveness to profiling feedback are not handled by one-shot generation or naive RL.
Main Contribution
A multi-agent workflow (Planner / Coder / Verifier) that coordinates profiling, system-level planning, code generation, and profiling-driven refinement for end-to-end GPU programs.
A rubric-based agentic RL recipe that trains the Coder on two atomic single-turn skills (from-scratch generation and feedback-driven optimization) to avoid costly multi-turn rollouts.
Key Findings
StitchCUDA achieves near-perfect correctness on end-to-end (Level 3) tasks and delivers positive system-level speedups on evaluated GPUs.
Rubric-based reward substantially reduces reward-hacking compared with plain rule-based RL.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Level 3 correctness | 10/10 | StitchCUDA-G (GPT-5.2 backend) 6/10 | ↑ 4 tasks | KernelBench Level 3 test set | Table 1 (Section 4.1) | Table 1 |
| E2E average speedup (H200, Level 3) | 1.50× | multi-agent no-RL variant (StitchCUDA-Q) 0.24× | ↑ 1.26× | KernelBench Level 3 test set | Section 4.1; Table 1 | Table 1 |
What To Try In 7 Days
Run Planner+Verifier loop on a small model: profile PyTorch reference with Nsight and extract a simple plan.
Use the multi-agent loop (Planner→Coder→Verifier) to iterate one end-to-end tensor workload and inspect profiler-guided suggestions.
Add a simple rubric or checklist to any RL or local search reward to penalize copying PyTorch-only code and encourage true kernel work.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Results evaluated on KernelBench with manual fixes; real-world workloads may differ.
Relies on large closed models (GPT-5.2, Qwen3-32B) and high-end GPUs (H200) for training/metrics.
When Not To Use
For single-kernel microbenchmarks where existing kernel tools already work well.
When you lack access to large LLMs or multi-GPU training budget.
Failure Modes
Reward hacking: models return PyTorch-only code or hardcoded outputs unless rubric catches it.
Degenerate conservative edits that leave critical kernels unchanged and yield small speedups.

