RTLLM: a 30-design benchmark for generating and evaluating RTL from natural-language, plus a 'self-planning' prompt trick

Overview

Decision SnapshotNeeds Validation

The benchmark and experiments give concrete comparisons and PPA numbers, but results are limited to Verilog, specific synthesis settings, and 30 designs; expect variability across LLM runs.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Yao Lu, Shang Liu, Qijun Zhang, Zhiyao Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RTLLM gives a standardized, automated way to test LLMs for RTL generation including practical PPA outcomes; a cheap prompt tweak (self-planning) can cut error rates and save engineering time compared with manual fixes.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Data Scientist

Summary TLDR

RTLLM is an open benchmark of 30 real digital designs (Verilog reference RTL, natural-language specs, and testbenches). It evaluates generated RTL on three staged goals: syntax, functionality, and design quality (PPA). The paper benchmarks GPT-3.5, GPT-4, two academic models, and shows a simple two-step prompt method called self-planning notably improves GPT-3.5 (syntax: 55% → 73%; functionality: 10/30 → 14/30), bringing its results close to GPT-4. GPT-4 still leads in overall correctness and PPA.

Problem Statement

Existing LLM-for-RTL work used small, author-crafted examples and focused only on correctness. That makes fair comparison hard and omits practical design quality (power, timing, area). RTLLM fills this gap with a larger, standardized benchmark and automated PPA evaluation.

Main Contribution

RTLLM: an open benchmark of 30 diverse digital designs with natural-language specs, testbenches, and human Verilog references.

A three-stage automatic evaluation pipeline: syntax checking (synthesizability), functionality (testbench pass), and design quality (post-synthesis PPA).

Key Findings

GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.

Numberssyntax 81%; functionality 15/30

Practical UseUse GPT-4 for the highest out-of-the-box chance of usable RTL; expect about half of designs to be functionally correct on this benchmark.

Evidence RefTable III, Section V-B

Self-planning raised GPT-3.5 syntax from 55% to 73% and functionality from 10/30 to 14/30.

Numberssyntax 55%→73%; functionality 10/30→14/30

Practical UseIf you rely on cheaper GPT-3.5, add the two-step self-planning prompt to get near-GPT-4 performance without model fine-tuning.

Evidence RefTable III, Section V-B

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Syntax correctness (GPT-4)	81%	GPT-3.5 55%	+26pp	RTLLM (30 designs, 5 runs each)	Table III	Section V-B
Functionality successes (GPT-4)	15/30 designs	GPT-3.5 10/30	+5 designs	RTLLM	Table III	Section V-B

What To Try In 7 Days

Clone RTLLM and run a few target designs through your LLM to measure syntax and function rates.

Apply the self-planning two-step prompt: ask the model to plan then generate code, and compare results.

Synthesize any passing generated RTL to check area/power/timing before trusting it in a flow.

Agent Features

Planning

self-planning (two-step prompt to produce a plan, then code)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/hkust-zhiyao/RTLLM

Data URLs

https://github.com/hkust-zhiyao/RTLLM

Risks & Boundaries

Limitations

Testbenches sample cases; passing them is not formal correctness for all inputs.

Benchmark centers on Verilog reference designs; VHDL/Chisel support is claimed but not demonstrated in experiments.

When Not To Use

When formal verification or exhaustive proof of correctness is required.

For analog or custom SRAM/memory compiler flows where RTL synthesis is not representative.

Failure Modes

Syntax errors that prevent synthesis (common for smaller LLMs).

Functional bugs that still pass limited testbenches or fail unseen inputs.

Core Entities

Models

GPT-3.5GPT-4Thakur et al. (CodeGen fine-tuned, 16B)StarCoder

Metrics

syntax correctnessfunctionality correctnessareapowertiming (WNS)

Datasets

RTLLM

Benchmarks

RTLLM

Context Entities

Models

CodeGen

Metrics

PPA comparisons used in synthesis

Datasets

prior small RTL datasets (Thakur et al., Chip-Chat, Chip-GPT)

Benchmarks

Chip-ChatChip-GPT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.

Self-planning raised GPT-3.5 syntax from 55% to 73% and functionality from 10/30 to 14/30.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding