VerilogEval: an automated sandbox and 156-problem benchmark to test LLMs on Verilog code and to study synthetic fine-tuning

Overview

Decision SnapshotNeeds Validation

The benchmark and SFT pipeline are useful for small and educational Verilog tasks, but they do not evaluate synthesizability, module instantiation, or PPA trade-offs, limiting production use.

Citations16

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.

Who Should Care

ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

This paper introduces VerilogEval, an open-source benchmark and sandbox to test LLMs on Verilog HDL tasks. It provides 156 human-curated problems, an automated simulator-based pass@k correctness test, and 8,502 synthetic problem-code pairs used for supervised fine-tuning (SFT). SFT on synthetic data improves pass@1 for several CodeGen variants and can match GPT-3.5 on this benchmark, but it can also overfit and reduce sampling diversity. The repo and tools are released for reproducible evaluation.

Problem Statement

There is no standard, automated benchmark that tests LLMs for Verilog (hardware) code generation with functional simulation. Verilog tasks often include images or diagrams, which block text-only LLMs. The paper builds a text-only, automatically testable benchmark and studies whether LLM-generated synthetic training pairs can improve Verilog code generation.

Main Contribution

VerilogEval: an open-source, text-only Verilog benchmark with 156 human-curated problems and a machine-generated subset.

A sandboxed automated testing pipeline using Icarus Verilog to check functional correctness by simulation and pass@k metrics.

Key Findings

Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.

Numberscodegen-2B-verilog: pass@1 20.1% -> 35.9% after SFT

Practical UseFine-tune code models with verified problem–code pairs to double single-shot success on small Verilog tasks.

Evidence RefTable IV

GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.

Numbersgpt-3.5 pass@1 (machine) 46.7%; codegen-16B-verilog-sft ~46.2%

Practical UseIf you need GPT-level Verilog snippets, SFT on domain pairs can make large open models competitive.

Evidence RefTable II

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
pass@1 (VerilogEval-machine)	gpt-3.5 46.7%; gpt-4 60.0%; codegen-16B-verilog-sft 46.2%	—	—	Table II (VerilogEval-machine)	Table II	Table II
pass@1 (codegen-2B-verilog)	20.1% -> 35.9% after SFT	20.1%	+15.8pp	VerilogEval-machine	Table IV	Table IV

What To Try In 7 Days

Run VerilogEval on your preferred LLM to baseline pass@1,5,10 using the repo.

Create a small set (50–200) of verified problem–code pairs and fine-tune a code model; track pass@1 and pass@10.

Audit synthetic pairs for correctness before SFT and use early stopping to avoid overfitting.

Optimization Features

Infra Optimization

single DGX node with 8x A100 for experiments

Training Optimization

SFTearly stopping to avoid overfitting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/NVlabs/verilog-eval

Data URLs

https://github.com/NVlabs/verilog-eval

Risks & Boundaries

Limitations

Benchmark covers self-contained modules only; no module instantiation or system-level design.

Evaluates functional correctness by simulation but not synthesizability or performance (PPA).

When Not To Use

When you need synthesizable RTL ready for fabrication or PPA-optimized code.

For multi-module or system integration tasks requiring hierarchical instantiation.

Failure Modes

LLM hallucination in problem descriptions leading to incorrect SFT pairs.

Overfitting during SFT that increases pass@1 but reduces diversity (lower pass@10).

Core Entities

Models

codegen-16B-verilogcodegen-16B-multicodegen-16B-nlcodegen-2B-verilogcodegen-multicodegen-nlgpt-3.5-turbogpt-4

Metrics

pass@kBLEU

Datasets

HDLBitsVerilogEval-human (156 problems)VerilogEval-machine (143 valid problems)SFTGithub Verilog corpus (filtered)

Benchmarks

VerilogEvalHumanEval

Context Entities

Models

CodeGen familyOpenAI GPT family

Metrics

pass@1pass@5pass@10

Datasets

The PileBigQuery multilingual code dataset

Benchmarks

HumanEvalMBPPAPPS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.

GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding