Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
12
Why It Matters For Business
RTLLM gives a standardized, automated way to test LLMs for RTL generation including practical PPA outcomes; a cheap prompt tweak (self-planning) can cut error rates and save engineering time compared with manual fixes.
Summary TLDR
RTLLM is an open benchmark of 30 real digital designs (Verilog reference RTL, natural-language specs, and testbenches). It evaluates generated RTL on three staged goals: syntax, functionality, and design quality (PPA). The paper benchmarks GPT-3.5, GPT-4, two academic models, and shows a simple two-step prompt method called self-planning notably improves GPT-3.5 (syntax: 55% → 73%; functionality: 10/30 → 14/30), bringing its results close to GPT-4. GPT-4 still leads in overall correctness and PPA.
Problem Statement
Existing LLM-for-RTL work used small, author-crafted examples and focused only on correctness. That makes fair comparison hard and omits practical design quality (power, timing, area). RTLLM fills this gap with a larger, standardized benchmark and automated PPA evaluation.
Main Contribution
RTLLM: an open benchmark of 30 diverse digital designs with natural-language specs, testbenches, and human Verilog references.
A three-stage automatic evaluation pipeline: syntax checking (synthesizability), functionality (testbench pass), and design quality (post-synthesis PPA).
A simple two-step prompt method, self-planning, that asks the LLM to plan first then generate code; this reduces syntax and functional errors.
Systematic comparison of GPT-3.5, GPT-4, two academic models, and self-planning variants using the benchmark and Synopsys tools.
Key Findings
GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.
Self-planning raised GPT-3.5 syntax from 55% to 73% and functionality from 10/30 to 14/30.
Academic open models (Thakur et al., StarCoder) lagged: syntax ~40% and ~27%, with 5/30 functional successes each.
For design quality (PPA), GPT-4 and GPT-3.5+self-planning sometimes outperform human-crafted references on individual metrics and won the most 'best' counts.
RTLLM covers larger designs than prior datasets: 30 designs with up to 518 HDL lines and up to 2435 cells in netlist (e.g., RISC CPU).
Results
Syntax correctness (GPT-4)
Functionality successes (GPT-4)
GPT-3.5 + self-planning (syntax/functionality)
Best-quality counts (PPA wins)
Who Should Care
What To Try In 7 Days
Clone RTLLM and run a few target designs through your LLM to measure syntax and function rates.
Apply the self-planning two-step prompt: ask the model to plan then generate code, and compare results.
Synthesize any passing generated RTL to check area/power/timing before trusting it in a flow.
Agent Features
Planning
- self-planning (two-step prompt to produce a plan, then code)
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Testbenches sample cases; passing them is not formal correctness for all inputs.
- Benchmark centers on Verilog reference designs; VHDL/Chisel support is claimed but not demonstrated in experiments.
- Synthesis and PPA use specific Design Compiler settings, which affect absolute numbers.
- LLM outputs are stochastic; reported success counts come from five runs per design.
When Not To Use
- When formal verification or exhaustive proof of correctness is required.
- For analog or custom SRAM/memory compiler flows where RTL synthesis is not representative.
- If you need single-shot guarantees for mission-critical hardware designs.
Failure Modes
- Syntax errors that prevent synthesis (common for smaller LLMs).
- Functional bugs that still pass limited testbenches or fail unseen inputs.
- High variability across runs; some designs only succeed in a subset of trials.
- Large/complex modules (e.g., RISC CPU) often remain unsynthesizable or fail functionality.
Core Entities
Models
- GPT-3.5
- GPT-4
- Thakur et al. (CodeGen fine-tuned, 16B)
- StarCoder
Metrics
- syntax correctness
- functionality correctness
- area
- power
- timing (WNS)
Datasets
- RTLLM
Benchmarks
- RTLLM
Context Entities
Models
- CodeGen
Metrics
- PPA comparisons used in synthesis
Datasets
- prior small RTL datasets (Thakur et al., Chip-Chat, Chip-GPT)
Benchmarks
- Chip-Chat
- Chip-GPT

