RTLLM: a 30-design benchmark for generating and evaluating RTL from natural-language, plus a 'self-planning' prompt trick

August 10, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark and experiments give concrete comparisons and PPA numbers, but results are limited to Verilog, specific synthesis settings, and 30 designs; expect variability across LLM runs.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Yao Lu, Shang Liu, Qijun Zhang, Zhiyao Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RTLLM gives a standardized, automated way to test LLMs for RTL generation including practical PPA outcomes; a cheap prompt tweak (self-planning) can cut error rates and save engineering time compared with manual fixes.

Who Should Care

Summary TLDR

RTLLM is an open benchmark of 30 real digital designs (Verilog reference RTL, natural-language specs, and testbenches). It evaluates generated RTL on three staged goals: syntax, functionality, and design quality (PPA). The paper benchmarks GPT-3.5, GPT-4, two academic models, and shows a simple two-step prompt method called self-planning notably improves GPT-3.5 (syntax: 55% → 73%; functionality: 10/30 → 14/30), bringing its results close to GPT-4. GPT-4 still leads in overall correctness and PPA.

Problem Statement

Existing LLM-for-RTL work used small, author-crafted examples and focused only on correctness. That makes fair comparison hard and omits practical design quality (power, timing, area). RTLLM fills this gap with a larger, standardized benchmark and automated PPA evaluation.

Main Contribution

RTLLM: an open benchmark of 30 diverse digital designs with natural-language specs, testbenches, and human Verilog references.

A three-stage automatic evaluation pipeline: syntax checking (synthesizability), functionality (testbench pass), and design quality (post-synthesis PPA).

Key Findings

GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.

Numberssyntax 81%; functionality 15/30

Practical UseUse GPT-4 for the highest out-of-the-box chance of usable RTL; expect about half of designs to be functionally correct on this benchmark.

Evidence RefTable III, Section V-B

Self-planning raised GPT-3.5 syntax from 55% to 73% and functionality from 10/30 to 14/30.

Numberssyntax 55%73%; functionality 10/3014/30

Practical UseIf you rely on cheaper GPT-3.5, add the two-step self-planning prompt to get near-GPT-4 performance without model fine-tuning.

Evidence RefTable III, Section V-B

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Syntax correctness (GPT-4)81%GPT-3.5 55%+26ppRTLLM (30 designs, 5 runs each)Table IIISection V-B
Functionality successes (GPT-4)15/30 designsGPT-3.5 10/30+5 designsRTLLMTable IIISection V-B

What To Try In 7 Days

Clone RTLLM and run a few target designs through your LLM to measure syntax and function rates.

Apply the self-planning two-step prompt: ask the model to plan then generate code, and compare results.

Synthesize any passing generated RTL to check area/power/timing before trusting it in a flow.

Agent Features

Planning
self-planning (two-step prompt to produce a plan, then code)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Testbenches sample cases; passing them is not formal correctness for all inputs.

Benchmark centers on Verilog reference designs; VHDL/Chisel support is claimed but not demonstrated in experiments.

When Not To Use

When formal verification or exhaustive proof of correctness is required.

For analog or custom SRAM/memory compiler flows where RTL synthesis is not representative.

Failure Modes

Syntax errors that prevent synthesis (common for smaller LLMs).

Functional bugs that still pass limited testbenches or fail unseen inputs.

Core Entities

Models

GPT-3.5GPT-4Thakur et al. (CodeGen fine-tuned, 16B)StarCoder

Metrics

syntax correctnessfunctionality correctnessareapowertiming (WNS)

Datasets

RTLLM

Benchmarks

RTLLM

Context Entities

Models

CodeGen

Metrics

PPA comparisons used in synthesis

Datasets

prior small RTL datasets (Thakur et al., Chip-Chat, Chip-GPT)

Benchmarks

Chip-ChatChip-GPT