LLMs can generate SystemVerilog security assertions from natural-language comments

June 24, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a working pipeline and large-scale experiments, but real-world use needs additional validation, domain prompts, and error-correction layers.

Citations20

Evidence Strength0.90

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, Jeyavijayan Rajendran

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.

Who Should Care

Summary TLDR

The authors build an end-to-end pipeline and a 10-case benchmark to test whether off-the-shelf code LLMs (notably OpenAI Codex / code-davinci-002) can turn natural-language comments and small context into SystemVerilog security assertions. They generated 226,800 candidate assertions across 2,268 prompt variants. Average correctness (matching a hand-written reference under exhaustive simulation) was 26.54% across all prompt types, but carefully crafted prompts (detailed comment + example) reached up to 93.55% correctness. Common failures include syntax, wrong signal names, and timing errors. The tool can speed up assertion drafting but still needs verification and expert oversight.

Problem Statement

Writing security-focused hardware assertions is hard, needs design/security expertise, and does not scale; the paper asks whether large code LLMs can automatically generate useful SystemVerilog security assertions from natural-language comments and small design context.

Main Contribution

A reproducible evaluation pipeline that prompts LLMs, post-processes outputs, simulates assertions with Modelsim, and auto-scores equivalence to a golden reference.

A benchmark suite of 10 realistic hardware modules (Hack@DAC and OpenTitan derived) and 10 golden reference assertions targeting common hardware CWEs.

Key Findings

LLMs can produce correct security assertions but quality varies widely.

NumbersAverage correctness 26.54% across 2,268 prompt types; best prompt 93.55%

Practical UseUse LLMs to draft assertions but expect heavy variation; rely on the best prompt patterns and validate each generated assertion with simulation or formal checks.

Evidence RefSec IV-C, Fig 3, Table IV, Conclusion

Prompt detail strongly drives success.

NumbersDetailed comment + detailed example (DetailedCom+DetailedEx) ≈72.16% average; top single config 93.55%

Practical UseWhen using LLMs, include a detailed natural-language comment and a similar example assertion to maximize correct outputs.

Evidence RefSec IV-D (1), Fig 4, Fig 11, Table IV

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
total_assertions_generated226,80010 benchmarks × 2,268 prompts × n=10Sec IV-C, ConclusionSec IV-C, Conclusion
average_correctness26.54%average across 2,268 prompt types and 10 benchmarksSec IV-C, Fig 3, Table IIISec IV-C

What To Try In 7 Days

Run the authors' framework on one small RTL module and compare generated assertions to existing checks.

Experiment with detailed natural-language comments plus a worked example to see correctness improve substantially.

Automate simple syntax fixes and run the generated assertions under an exhaustive testbench or formal check before trusting them.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark set covers 10 cases and selected CWEs; broader vulnerabilities may behave differently.

Framework uses trimmed source (<100 LOC); large multi-module designs may reduce LLM inference accuracy.

When Not To Use

When you lack engineers who can review and validate generated assertions.

For final releases of safety-critical hardware without formal proof or exhaustive validation.

Failure Modes

Invalid SystemVerilog syntax produced by the LLM.

Use of signals or indices not present in module interface.

Core Entities

Models

code-davinci-002code-cushman-001codegen-2b-ftChatGPT (Jan 9 2023)OpenAI Codex

Metrics

correct_simulated_assertions (%)compilation_rate (%)simulation_reachability / hangsAccuracy

Datasets

Benchmark suite BM1..BM10 (Hack@DAC, OpenTitan derived)

Benchmarks

BM1 Lockable RegisterBM2 Traffic signal controllerBM3 JTAG password controllerBM4 Bus access controlBM5 AES IP (data output before completion)BM6 AES IP (clear keys on debug)BM7 CVA6 register controller privilegeBM8 Register lock initializationBM9 OpenTitan ADC controllerBM10 OpenTitan reset manager

Context Entities

Models

GPT-family description (background)Codex training on GitHub code (cited)

Metrics

exhaustive simulation over parametrized testbench

Datasets

Trimmed design source files (<100 LOC) for each benchmark

Benchmarks

Hack@DAC designsOpenTitan SoC modules