VerilogEval: an automated sandbox and 156-problem benchmark to test LLMs on Verilog code and to study synthetic fine-tuning

September 14, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

16

Authors

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren

Links

Abstract / PDF

Why It Matters For Business

Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.

Summary TLDR

This paper introduces VerilogEval, an open-source benchmark and sandbox to test LLMs on Verilog HDL tasks. It provides 156 human-curated problems, an automated simulator-based pass@k correctness test, and 8,502 synthetic problem-code pairs used for supervised fine-tuning (SFT). SFT on synthetic data improves pass@1 for several CodeGen variants and can match GPT-3.5 on this benchmark, but it can also overfit and reduce sampling diversity. The repo and tools are released for reproducible evaluation.

Problem Statement

There is no standard, automated benchmark that tests LLMs for Verilog (hardware) code generation with functional simulation. Verilog tasks often include images or diagrams, which block text-only LLMs. The paper builds a text-only, automatically testable benchmark and studies whether LLM-generated synthetic training pairs can improve Verilog code generation.

Main Contribution

VerilogEval: an open-source, text-only Verilog benchmark with 156 human-curated problems and a machine-generated subset.

A sandboxed automated testing pipeline using Icarus Verilog to check functional correctness by simulation and pass@k metrics.

A procedure to create 8,502 synthetic problem–code pairs via LLMs and experiments showing supervised fine-tuning (SFT) on that data improves model pass@1 scores.

Key Findings

Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.

Numberscodegen-2B-verilog: pass@1 20.1% -> 35.9% after SFT

GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.

Numbersgpt-3.5 pass@1 (machine) 46.7%; codegen-16B-verilog-sft ~46.2%

SFT increases pass@1 while reducing pass@5 and pass@10 with continued epochs (overfitting/reduced diversity).

Numberspass@1 rises with epochs; pass@5/10 drop after more epochs (Fig. 8)

Poor-quality SFT data hurts performance.

Numberscodegen-2B-verilog-sft-error pass@1 21.4% vs sft 35.9%

Text-only machine descriptions can produce many valid problems but differ from human descriptions.

NumbersGenerated 143 valid machine descriptions out of 156 candidates

Results

pass@1 (VerilogEval-machine)

Valuegpt-3.5 46.7%; gpt-4 60.0%; codegen-16B-verilog-sft 46.2%

pass@1 (codegen-2B-verilog)

Value20.1% -> 35.9% after SFT

Baseline20.1%

SFT

Valuepass@1 increases; pass@5 and pass@10 decrease after many epochs

base model choice effect

Valuecodegen-16B-nl-sft pass@1 33.9% vs codegen-16B-multi-sft 37.1%

Baselinecodegen-16B-nl-sft 33.9%

Who Should Care

What To Try In 7 Days

Run VerilogEval on your preferred LLM to baseline pass@1,5,10 using the repo.

Create a small set (50–200) of verified problem–code pairs and fine-tune a code model; track pass@1 and pass@10.

Audit synthetic pairs for correctness before SFT and use early stopping to avoid overfitting.

Optimization Features

Infra Optimization

  • single DGX node with 8x A100 for experiments

Training Optimization

  • SFT
  • early stopping to avoid overfitting

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Benchmark covers self-contained modules only; no module instantiation or system-level design.
  • Evaluates functional correctness by simulation but not synthesizability or performance (PPA).
  • Problems are small-scale; results may not transfer to complex chip designs.
  • Machine-generated descriptions can include low-level implementation bias or ambiguity.

When Not To Use

  • When you need synthesizable RTL ready for fabrication or PPA-optimized code.
  • For multi-module or system integration tasks requiring hierarchical instantiation.
  • As the sole measure of model quality for large hardware projects.

Failure Modes

  • LLM hallucination in problem descriptions leading to incorrect SFT pairs.
  • Overfitting during SFT that increases pass@1 but reduces diversity (lower pass@10).
  • Simulator incompatibility: Icarus Verilog may not support full IEEE-1364 features used by some code.

Core Entities

Models

  • codegen-16B-verilog
  • codegen-16B-multi
  • codegen-16B-nl
  • codegen-2B-verilog
  • codegen-multi
  • codegen-nl
  • gpt-3.5-turbo
  • gpt-4

Metrics

  • pass@k
  • BLEU

Datasets

  • HDLBits
  • VerilogEval-human (156 problems)
  • VerilogEval-machine (143 valid problems)
  • SFT
  • Github Verilog corpus (filtered)

Benchmarks

  • VerilogEval
  • HumanEval

Context Entities

Models

  • CodeGen family
  • OpenAI GPT family

Metrics

  • pass@1
  • pass@5
  • pass@10

Datasets

  • The Pile
  • BigQuery multilingual code dataset

Benchmarks

  • HumanEval
  • MBPP
  • APPS