Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
20
Why It Matters For Business
LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.
Summary TLDR
The authors build an end-to-end pipeline and a 10-case benchmark to test whether off-the-shelf code LLMs (notably OpenAI Codex / code-davinci-002) can turn natural-language comments and small context into SystemVerilog security assertions. They generated 226,800 candidate assertions across 2,268 prompt variants. Average correctness (matching a hand-written reference under exhaustive simulation) was 26.54% across all prompt types, but carefully crafted prompts (detailed comment + example) reached up to 93.55% correctness. Common failures include syntax, wrong signal names, and timing errors. The tool can speed up assertion drafting but still needs verification and expert oversight.
Problem Statement
Writing security-focused hardware assertions is hard, needs design/security expertise, and does not scale; the paper asks whether large code LLMs can automatically generate useful SystemVerilog security assertions from natural-language comments and small design context.
Main Contribution
A reproducible evaluation pipeline that prompts LLMs, post-processes outputs, simulates assertions with Modelsim, and auto-scores equivalence to a golden reference.
A benchmark suite of 10 realistic hardware modules (Hack@DAC and OpenTitan derived) and 10 golden reference assertions targeting common hardware CWEs.
A large-scale study of prompt variants (2,268 prompt types) and four LLMs, showing how prompt design and hyperparameters affect correctness.
Open-sourcing the framework and benchmarks to enable follow-on work (authors state they will release code and data).
Key Findings
LLMs can produce correct security assertions but quality varies widely.
Prompt detail strongly drives success.
Common failure modes are syntactic and signal/timing errors.
Different LLMs vary in capability; stronger code models work better.
Results
total_assertions_generated
average_correctness
best_prompt_correctness
model_coverage_best_prompt
Who Should Care
What To Try In 7 Days
Run the authors' framework on one small RTL module and compare generated assertions to existing checks.
Experiment with detailed natural-language comments plus a worked example to see correctness improve substantially.
Automate simple syntax fixes and run the generated assertions under an exhaustive testbench or formal check before trusting them.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark set covers 10 cases and selected CWEs; broader vulnerabilities may behave differently.
- Framework uses trimmed source (<100 LOC); large multi-module designs may reduce LLM inference accuracy.
- Reference assertions are one reasonable encoding of intent and could mark alternative correct assertions as incorrect.
- Generated assertions often need human review for logic and timing correctness despite passing syntax fixes.
When Not To Use
- When you lack engineers who can review and validate generated assertions.
- For final releases of safety-critical hardware without formal proof or exhaustive validation.
- When prompts cannot include sufficiently detailed comments or examples.
Failure Modes
- Invalid SystemVerilog syntax produced by the LLM.
- Use of signals or indices not present in module interface.
- Incorrect timing or logical condition (triggers on wrong input sets).
- Extra or conflicting assertions in the same output causing incorrect behavior.
Core Entities
Models
- code-davinci-002
- code-cushman-001
- codegen-2b-ft
- ChatGPT (Jan 9 2023)
- OpenAI Codex
Metrics
- correct_simulated_assertions (%)
- compilation_rate (%)
- simulation_reachability / hangs
- Accuracy
Datasets
- Benchmark suite BM1..BM10 (Hack@DAC, OpenTitan derived)
Benchmarks
- BM1 Lockable Register
- BM2 Traffic signal controller
- BM3 JTAG password controller
- BM4 Bus access control
- BM5 AES IP (data output before completion)
- BM6 AES IP (clear keys on debug)
- BM7 CVA6 register controller privilege
- BM8 Register lock initialization
- BM9 OpenTitan ADC controller
- BM10 OpenTitan reset manager
Context Entities
Models
- GPT-family description (background)
- Codex training on GitHub code (cited)
Metrics
- exhaustive simulation over parametrized testbench
Datasets
- Trimmed design source files (<100 LOC) for each benchmark
Benchmarks
- Hack@DAC designs
- OpenTitan SoC modules

