LLMs can generate SystemVerilog security assertions from natural-language comments

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a working pipeline and large-scale experiments, but real-world use needs additional validation, domain prompts, and error-correction layers.

Citations20

Evidence Strength0.90

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, Jeyavijayan Rajendran

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

The authors build an end-to-end pipeline and a 10-case benchmark to test whether off-the-shelf code LLMs (notably OpenAI Codex / code-davinci-002) can turn natural-language comments and small context into SystemVerilog security assertions. They generated 226,800 candidate assertions across 2,268 prompt variants. Average correctness (matching a hand-written reference under exhaustive simulation) was 26.54% across all prompt types, but carefully crafted prompts (detailed comment + example) reached up to 93.55% correctness. Common failures include syntax, wrong signal names, and timing errors. The tool can speed up assertion drafting but still needs verification and expert oversight.

Problem Statement

Writing security-focused hardware assertions is hard, needs design/security expertise, and does not scale; the paper asks whether large code LLMs can automatically generate useful SystemVerilog security assertions from natural-language comments and small design context.

Main Contribution

A reproducible evaluation pipeline that prompts LLMs, post-processes outputs, simulates assertions with Modelsim, and auto-scores equivalence to a golden reference.

A benchmark suite of 10 realistic hardware modules (Hack@DAC and OpenTitan derived) and 10 golden reference assertions targeting common hardware CWEs.

Key Findings

LLMs can produce correct security assertions but quality varies widely.

NumbersAverage correctness 26.54% across 2,268 prompt types; best prompt 93.55%

Practical UseUse LLMs to draft assertions but expect heavy variation; rely on the best prompt patterns and validate each generated assertion with simulation or formal checks.

Evidence RefSec IV-C, Fig 3, Table IV, Conclusion

Prompt detail strongly drives success.

NumbersDetailed comment + detailed example (DetailedCom+DetailedEx) ≈72.16% average; top single config 93.55%

Practical UseWhen using LLMs, include a detailed natural-language comment and a similar example assertion to maximize correct outputs.

Evidence RefSec IV-D (1), Fig 4, Fig 11, Table IV

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
total_assertions_generated	226,800	—	—	10 benchmarks × 2,268 prompts × n=10	Sec IV-C, Conclusion	Sec IV-C, Conclusion
average_correctness	26.54%	—	—	average across 2,268 prompt types and 10 benchmarks	Sec IV-C, Fig 3, Table III	Sec IV-C

What To Try In 7 Days

Run the authors' framework on one small RTL module and compare generated assertions to existing checks.

Experiment with detailed natural-language comments plus a worked example to see correctness improve substantially.

Automate simple syntax fixes and run the generated assertions under an exhaustive testbench or formal check before trusting them.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Benchmark set covers 10 cases and selected CWEs; broader vulnerabilities may behave differently.

Framework uses trimmed source (<100 LOC); large multi-module designs may reduce LLM inference accuracy.

When Not To Use

When you lack engineers who can review and validate generated assertions.

For final releases of safety-critical hardware without formal proof or exhaustive validation.

Failure Modes

Invalid SystemVerilog syntax produced by the LLM.

Use of signals or indices not present in module interface.

Core Entities

Models

code-davinci-002code-cushman-001codegen-2b-ftChatGPT (Jan 9 2023)OpenAI Codex

Metrics

correct_simulated_assertions (%)compilation_rate (%)simulation_reachability / hangsAccuracy

Datasets

Benchmark suite BM1..BM10 (Hack@DAC, OpenTitan derived)

Benchmarks

BM1 Lockable RegisterBM2 Traffic signal controllerBM3 JTAG password controllerBM4 Bus access controlBM5 AES IP (data output before completion)BM6 AES IP (clear keys on debug)BM7 CVA6 register controller privilegeBM8 Register lock initializationBM9 OpenTitan ADC controllerBM10 OpenTitan reset manager

Context Entities

Models

GPT-family description (background)Codex training on GitHub code (cited)

Metrics

exhaustive simulation over parametrized testbench

Datasets

Trimmed design source files (<100 LOC) for each benchmark

Benchmarks

Hack@DAC designsOpenTitan SoC modules

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs can produce correct security assertions but quality varies widely.

Prompt detail strongly drives success.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding