Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
16
Why It Matters For Business
Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.
Summary TLDR
This paper introduces VerilogEval, an open-source benchmark and sandbox to test LLMs on Verilog HDL tasks. It provides 156 human-curated problems, an automated simulator-based pass@k correctness test, and 8,502 synthetic problem-code pairs used for supervised fine-tuning (SFT). SFT on synthetic data improves pass@1 for several CodeGen variants and can match GPT-3.5 on this benchmark, but it can also overfit and reduce sampling diversity. The repo and tools are released for reproducible evaluation.
Problem Statement
There is no standard, automated benchmark that tests LLMs for Verilog (hardware) code generation with functional simulation. Verilog tasks often include images or diagrams, which block text-only LLMs. The paper builds a text-only, automatically testable benchmark and studies whether LLM-generated synthetic training pairs can improve Verilog code generation.
Main Contribution
VerilogEval: an open-source, text-only Verilog benchmark with 156 human-curated problems and a machine-generated subset.
A sandboxed automated testing pipeline using Icarus Verilog to check functional correctness by simulation and pass@k metrics.
A procedure to create 8,502 synthetic problem–code pairs via LLMs and experiments showing supervised fine-tuning (SFT) on that data improves model pass@1 scores.
Key Findings
Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.
GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.
SFT increases pass@1 while reducing pass@5 and pass@10 with continued epochs (overfitting/reduced diversity).
Poor-quality SFT data hurts performance.
Text-only machine descriptions can produce many valid problems but differ from human descriptions.
Results
pass@1 (VerilogEval-machine)
pass@1 (codegen-2B-verilog)
SFT
base model choice effect
Who Should Care
What To Try In 7 Days
Run VerilogEval on your preferred LLM to baseline pass@1,5,10 using the repo.
Create a small set (50–200) of verified problem–code pairs and fine-tune a code model; track pass@1 and pass@10.
Audit synthetic pairs for correctness before SFT and use early stopping to avoid overfitting.
Optimization Features
Infra Optimization
- single DGX node with 8x A100 for experiments
Training Optimization
- SFT
- early stopping to avoid overfitting
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Benchmark covers self-contained modules only; no module instantiation or system-level design.
- Evaluates functional correctness by simulation but not synthesizability or performance (PPA).
- Problems are small-scale; results may not transfer to complex chip designs.
- Machine-generated descriptions can include low-level implementation bias or ambiguity.
When Not To Use
- When you need synthesizable RTL ready for fabrication or PPA-optimized code.
- For multi-module or system integration tasks requiring hierarchical instantiation.
- As the sole measure of model quality for large hardware projects.
Failure Modes
- LLM hallucination in problem descriptions leading to incorrect SFT pairs.
- Overfitting during SFT that increases pass@1 but reduces diversity (lower pass@10).
- Simulator incompatibility: Icarus Verilog may not support full IEEE-1364 features used by some code.
Core Entities
Models
- codegen-16B-verilog
- codegen-16B-multi
- codegen-16B-nl
- codegen-2B-verilog
- codegen-multi
- codegen-nl
- gpt-3.5-turbo
- gpt-4
Metrics
- pass@k
- BLEU
Datasets
- HDLBits
- VerilogEval-human (156 problems)
- VerilogEval-machine (143 valid problems)
- SFT
- Github Verilog corpus (filtered)
Benchmarks
- VerilogEval
- HumanEval
Context Entities
Models
- CodeGen family
- OpenAI GPT family
Metrics
- pass@1
- pass@5
- pass@10
Datasets
- The Pile
- BigQuery multilingual code dataset
Benchmarks
- HumanEval
- MBPP
- APPS

