VerilogEval: an automated sandbox and 156-problem benchmark to test LLMs on Verilog code and to study synthetic fine-tuning

September 14, 20236 min

Overview

Decision SnapshotNeeds Validation

The benchmark and SFT pipeline are useful for small and educational Verilog tasks, but they do not evaluate synthesizability, module instantiation, or PPA trade-offs, limiting production use.

Citations16

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.

Who Should Care

Summary TLDR

This paper introduces VerilogEval, an open-source benchmark and sandbox to test LLMs on Verilog HDL tasks. It provides 156 human-curated problems, an automated simulator-based pass@k correctness test, and 8,502 synthetic problem-code pairs used for supervised fine-tuning (SFT). SFT on synthetic data improves pass@1 for several CodeGen variants and can match GPT-3.5 on this benchmark, but it can also overfit and reduce sampling diversity. The repo and tools are released for reproducible evaluation.

Problem Statement

There is no standard, automated benchmark that tests LLMs for Verilog (hardware) code generation with functional simulation. Verilog tasks often include images or diagrams, which block text-only LLMs. The paper builds a text-only, automatically testable benchmark and studies whether LLM-generated synthetic training pairs can improve Verilog code generation.

Main Contribution

VerilogEval: an open-source, text-only Verilog benchmark with 156 human-curated problems and a machine-generated subset.

A sandboxed automated testing pipeline using Icarus Verilog to check functional correctness by simulation and pass@k metrics.

Key Findings

Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.

Numberscodegen-2B-verilog: pass@1 20.1% -> 35.9% after SFT

Practical UseFine-tune code models with verified problem–code pairs to double single-shot success on small Verilog tasks.

Evidence RefTable IV

GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.

Numbersgpt-3.5 pass@1 (machine) 46.7%; codegen-16B-verilog-sft ~46.2%

Practical UseIf you need GPT-level Verilog snippets, SFT on domain pairs can make large open models competitive.

Evidence RefTable II

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
pass@1 (VerilogEval-machine)gpt-3.5 46.7%; gpt-4 60.0%; codegen-16B-verilog-sft 46.2%Table II (VerilogEval-machine)Table IITable II
pass@1 (codegen-2B-verilog)20.1% -> 35.9% after SFT20.1%+15.8ppVerilogEval-machineTable IVTable IV

What To Try In 7 Days

Run VerilogEval on your preferred LLM to baseline pass@1,5,10 using the repo.

Create a small set (50–200) of verified problem–code pairs and fine-tune a code model; track pass@1 and pass@10.

Audit synthetic pairs for correctness before SFT and use early stopping to avoid overfitting.

Optimization Features

Infra Optimization
single DGX node with 8x A100 for experiments
Training Optimization
SFTearly stopping to avoid overfitting

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark covers self-contained modules only; no module instantiation or system-level design.

Evaluates functional correctness by simulation but not synthesizability or performance (PPA).

When Not To Use

When you need synthesizable RTL ready for fabrication or PPA-optimized code.

For multi-module or system integration tasks requiring hierarchical instantiation.

Failure Modes

LLM hallucination in problem descriptions leading to incorrect SFT pairs.

Overfitting during SFT that increases pass@1 but reduces diversity (lower pass@10).

Core Entities

Models

codegen-16B-verilogcodegen-16B-multicodegen-16B-nlcodegen-2B-verilogcodegen-multicodegen-nlgpt-3.5-turbogpt-4

Metrics

pass@kBLEU

Datasets

HDLBitsVerilogEval-human (156 problems)VerilogEval-machine (143 valid problems)SFTGithub Verilog corpus (filtered)

Benchmarks

VerilogEvalHumanEval

Context Entities

Models

CodeGen familyOpenAI GPT family

Metrics

pass@1pass@5pass@10

Datasets

The PileBigQuery multilingual code dataset

Benchmarks

HumanEvalMBPPAPPS