Overview
The benchmark and SFT pipeline are useful for small and educational Verilog tasks, but they do not evaluate synthesizability, module instantiation, or PPA trade-offs, limiting production use.
Citations16
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.
Who Should Care
Summary TLDR
This paper introduces VerilogEval, an open-source benchmark and sandbox to test LLMs on Verilog HDL tasks. It provides 156 human-curated problems, an automated simulator-based pass@k correctness test, and 8,502 synthetic problem-code pairs used for supervised fine-tuning (SFT). SFT on synthetic data improves pass@1 for several CodeGen variants and can match GPT-3.5 on this benchmark, but it can also overfit and reduce sampling diversity. The repo and tools are released for reproducible evaluation.
Problem Statement
There is no standard, automated benchmark that tests LLMs for Verilog (hardware) code generation with functional simulation. Verilog tasks often include images or diagrams, which block text-only LLMs. The paper builds a text-only, automatically testable benchmark and studies whether LLM-generated synthetic training pairs can improve Verilog code generation.
Main Contribution
VerilogEval: an open-source, text-only Verilog benchmark with 156 human-curated problems and a machine-generated subset.
A sandboxed automated testing pipeline using Icarus Verilog to check functional correctness by simulation and pass@k metrics.
Key Findings
Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.
GPT models outperform base codegen variants on many tasks but SFT can match GPT-3.5.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| pass@1 (VerilogEval-machine) | gpt-3.5 46.7%; gpt-4 60.0%; codegen-16B-verilog-sft 46.2% | — | — | Table II (VerilogEval-machine) | Table II | Table II |
| pass@1 (codegen-2B-verilog) | 20.1% -> 35.9% after SFT | 20.1% | +15.8pp | VerilogEval-machine | Table IV | Table IV |
What To Try In 7 Days
Run VerilogEval on your preferred LLM to baseline pass@1,5,10 using the repo.
Create a small set (50–200) of verified problem–code pairs and fine-tune a code model; track pass@1 and pass@10.
Audit synthetic pairs for correctness before SFT and use early stopping to avoid overfitting.
Optimization Features
Infra Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark covers self-contained modules only; no module instantiation or system-level design.
Evaluates functional correctness by simulation but not synthesizability or performance (PPA).
When Not To Use
When you need synthesizable RTL ready for fabrication or PPA-optimized code.
For multi-module or system integration tasks requiring hierarchical instantiation.
Failure Modes
LLM hallucination in problem descriptions leading to incorrect SFT pairs.
Overfitting during SFT that increases pass@1 but reduces diversity (lower pass@10).

