RTLLM: a 30-design benchmark for generating and evaluating RTL from natural-language, plus a 'self-planning' prompt trick

August 10, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

12

Authors

Yao Lu, Shang Liu, Qijun Zhang, Zhiyao Xie

Links

Abstract / PDF

Why It Matters For Business

RTLLM gives a standardized, automated way to test LLMs for RTL generation including practical PPA outcomes; a cheap prompt tweak (self-planning) can cut error rates and save engineering time compared with manual fixes.

Summary TLDR

RTLLM is an open benchmark of 30 real digital designs (Verilog reference RTL, natural-language specs, and testbenches). It evaluates generated RTL on three staged goals: syntax, functionality, and design quality (PPA). The paper benchmarks GPT-3.5, GPT-4, two academic models, and shows a simple two-step prompt method called self-planning notably improves GPT-3.5 (syntax: 55% → 73%; functionality: 10/30 → 14/30), bringing its results close to GPT-4. GPT-4 still leads in overall correctness and PPA.

Problem Statement

Existing LLM-for-RTL work used small, author-crafted examples and focused only on correctness. That makes fair comparison hard and omits practical design quality (power, timing, area). RTLLM fills this gap with a larger, standardized benchmark and automated PPA evaluation.

Main Contribution

RTLLM: an open benchmark of 30 diverse digital designs with natural-language specs, testbenches, and human Verilog references.

A three-stage automatic evaluation pipeline: syntax checking (synthesizability), functionality (testbench pass), and design quality (post-synthesis PPA).

A simple two-step prompt method, self-planning, that asks the LLM to plan first then generate code; this reduces syntax and functional errors.

Systematic comparison of GPT-3.5, GPT-4, two academic models, and self-planning variants using the benchmark and Synopsys tools.

Key Findings

GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.

Numberssyntax 81%; functionality 15/30

Self-planning raised GPT-3.5 syntax from 55% to 73% and functionality from 10/30 to 14/30.

Numberssyntax 55%→73%; functionality 10/30→14/30

Academic open models (Thakur et al., StarCoder) lagged: syntax ~40% and ~27%, with 5/30 functional successes each.

NumbersThakur syntax 40%; StarCoder 27%; functionality 5/30

For design quality (PPA), GPT-4 and GPT-3.5+self-planning sometimes outperform human-crafted references on individual metrics and won the most 'best' counts.

NumbersGPT-4 best-quality counts highest; GPT-3.5+SP ranked second (Area 5, Power 7, Timing 5)

RTLLM covers larger designs than prior datasets: 30 designs with up to 518 HDL lines and up to 2435 cells in netlist (e.g., RISC CPU).

Numbers30 designs; max HDL lines 518; max cells 2435

Results

Syntax correctness (GPT-4)

Value81%

BaselineGPT-3.5 55%

Functionality successes (GPT-4)

Value15/30 designs

BaselineGPT-3.5 10/30

GPT-3.5 + self-planning (syntax/functionality)

Value73% / 14/30

BaselineGPT-3.5 55% / 10/30

Best-quality counts (PPA wins)

ValueGPT-4 top; GPT-3.5+SP second

Baselinedesigner references included

Who Should Care

What To Try In 7 Days

Clone RTLLM and run a few target designs through your LLM to measure syntax and function rates.

Apply the self-planning two-step prompt: ask the model to plan then generate code, and compare results.

Synthesize any passing generated RTL to check area/power/timing before trusting it in a flow.

Agent Features

Planning

  • self-planning (two-step prompt to produce a plan, then code)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Testbenches sample cases; passing them is not formal correctness for all inputs.
  • Benchmark centers on Verilog reference designs; VHDL/Chisel support is claimed but not demonstrated in experiments.
  • Synthesis and PPA use specific Design Compiler settings, which affect absolute numbers.
  • LLM outputs are stochastic; reported success counts come from five runs per design.

When Not To Use

  • When formal verification or exhaustive proof of correctness is required.
  • For analog or custom SRAM/memory compiler flows where RTL synthesis is not representative.
  • If you need single-shot guarantees for mission-critical hardware designs.

Failure Modes

  • Syntax errors that prevent synthesis (common for smaller LLMs).
  • Functional bugs that still pass limited testbenches or fail unseen inputs.
  • High variability across runs; some designs only succeed in a subset of trials.
  • Large/complex modules (e.g., RISC CPU) often remain unsynthesizable or fail functionality.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Thakur et al. (CodeGen fine-tuned, 16B)
  • StarCoder

Metrics

  • syntax correctness
  • functionality correctness
  • area
  • power
  • timing (WNS)

Datasets

  • RTLLM

Benchmarks

  • RTLLM

Context Entities

Models

  • CodeGen

Metrics

  • PPA comparisons used in synthesis

Datasets

  • prior small RTL datasets (Thakur et al., Chip-Chat, Chip-GPT)

Benchmarks

  • Chip-Chat
  • Chip-GPT