LLMs can generate SystemVerilog security assertions from natural-language comments

June 24, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

20

Authors

Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, Jeyavijayan Rajendran

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.

Summary TLDR

The authors build an end-to-end pipeline and a 10-case benchmark to test whether off-the-shelf code LLMs (notably OpenAI Codex / code-davinci-002) can turn natural-language comments and small context into SystemVerilog security assertions. They generated 226,800 candidate assertions across 2,268 prompt variants. Average correctness (matching a hand-written reference under exhaustive simulation) was 26.54% across all prompt types, but carefully crafted prompts (detailed comment + example) reached up to 93.55% correctness. Common failures include syntax, wrong signal names, and timing errors. The tool can speed up assertion drafting but still needs verification and expert oversight.

Problem Statement

Writing security-focused hardware assertions is hard, needs design/security expertise, and does not scale; the paper asks whether large code LLMs can automatically generate useful SystemVerilog security assertions from natural-language comments and small design context.

Main Contribution

A reproducible evaluation pipeline that prompts LLMs, post-processes outputs, simulates assertions with Modelsim, and auto-scores equivalence to a golden reference.

A benchmark suite of 10 realistic hardware modules (Hack@DAC and OpenTitan derived) and 10 golden reference assertions targeting common hardware CWEs.

A large-scale study of prompt variants (2,268 prompt types) and four LLMs, showing how prompt design and hyperparameters affect correctness.

Open-sourcing the framework and benchmarks to enable follow-on work (authors state they will release code and data).

Key Findings

LLMs can produce correct security assertions but quality varies widely.

NumbersAverage correctness 26.54% across 2,268 prompt types; best prompt 93.55%

Prompt detail strongly drives success.

NumbersDetailed comment + detailed example (DetailedCom+DetailedEx) ≈72.16% average; top single config 93.55%

Common failure modes are syntactic and signal/timing errors.

NumbersFrequent errors: invalid Verilog syntax, wrong/unused variables, incorrect timing (examples in Listing 9)

Different LLMs vary in capability; stronger code models work better.

NumbersBest-prompt run: code-davinci-002 correct on 9/10 benchmarks; code-cushman-001 7/10; codegen-2b-ft 4/10; ChatGPT 5/10

Results

total_assertions_generated

Value226,800

average_correctness

Value26.54%

best_prompt_correctness

Value93.55%

model_coverage_best_prompt

Valuecode-davinci-002: 9/10 benchmarks

Who Should Care

What To Try In 7 Days

Run the authors' framework on one small RTL module and compare generated assertions to existing checks.

Experiment with detailed natural-language comments plus a worked example to see correctness improve substantially.

Automate simple syntax fixes and run the generated assertions under an exhaustive testbench or formal check before trusting them.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark set covers 10 cases and selected CWEs; broader vulnerabilities may behave differently.
  • Framework uses trimmed source (<100 LOC); large multi-module designs may reduce LLM inference accuracy.
  • Reference assertions are one reasonable encoding of intent and could mark alternative correct assertions as incorrect.
  • Generated assertions often need human review for logic and timing correctness despite passing syntax fixes.

When Not To Use

  • When you lack engineers who can review and validate generated assertions.
  • For final releases of safety-critical hardware without formal proof or exhaustive validation.
  • When prompts cannot include sufficiently detailed comments or examples.

Failure Modes

  • Invalid SystemVerilog syntax produced by the LLM.
  • Use of signals or indices not present in module interface.
  • Incorrect timing or logical condition (triggers on wrong input sets).
  • Extra or conflicting assertions in the same output causing incorrect behavior.

Core Entities

Models

  • code-davinci-002
  • code-cushman-001
  • codegen-2b-ft
  • ChatGPT (Jan 9 2023)
  • OpenAI Codex

Metrics

  • correct_simulated_assertions (%)
  • compilation_rate (%)
  • simulation_reachability / hangs
  • Accuracy

Datasets

  • Benchmark suite BM1..BM10 (Hack@DAC, OpenTitan derived)

Benchmarks

  • BM1 Lockable Register
  • BM2 Traffic signal controller
  • BM3 JTAG password controller
  • BM4 Bus access control
  • BM5 AES IP (data output before completion)
  • BM6 AES IP (clear keys on debug)
  • BM7 CVA6 register controller privilege
  • BM8 Register lock initialization
  • BM9 OpenTitan ADC controller
  • BM10 OpenTitan reset manager

Context Entities

Models

  • GPT-family description (background)
  • Codex training on GitHub code (cited)

Metrics

  • exhaustive simulation over parametrized testbench

Datasets

  • Trimmed design source files (<100 LOC) for each benchmark

Benchmarks

  • Hack@DAC designs
  • OpenTitan SoC modules