Use LLMs to auto-generate hardware test inputs and recover coverage that random testing misses

October 6, 20237 min

Overview

Decision SnapshotNeeds Validation

Results show promise on several DUTs but large model and prompt variance; more DUTs and open datasets are needed to generalize.

Citations9

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 55%

Novelty: 65%

Authors

Zixi Zhang, Balint Szekely, Pedro Gimenes, Greg Chadwick, Hugo McNally, Jianyi Cheng, Robert Mullins, Yiren Zhao

Links

Abstract / PDF / Data

Why It Matters For Business

LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.

Who Should Care

Summary TLDR

This paper presents LLM4DV, an open benchmarking framework that uses prompted large language models (LLMs) to generate test stimuli for hardware design verification. The authors test six LLMs on eight hardware modules and introduce six prompting improvements. On the evaluated modules, LLM-based generation matched or outperformed naive constrained-random testing and reached 89.7%–100% coverage on many targets when prompts were optimized. Results vary a lot by model and DUT; prompt design and supplying DUT code help most.

Problem Statement

Hardware verification needs many targeted test inputs (stimuli). Creating those stimuli takes most of a chip project's engineering time and expertise. Random testing misses hard-to-hit states. The paper asks: can LLMs reduce human effort by reasoning about coverage plans and proposing targeted stimuli?

Main Contribution

LLM4DV: an open framework that orchestrates LLMs to produce hardware test stimuli and measure coverage.

Six practical prompting enhancements (e.g., missed-bin sampling, best-iterative-message sampling, dialogue restarting, few-shot examples, include DUT code).

Key Findings

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Numbers100% coverage on Asynchronous FIFO & AMPLE Weight Bank (Table III)

Practical UseFor straightforward components, try LLM-driven stimulus generation first — it can hit all tracked bins and save manual test design.

Evidence RefTable III

Performance varies widely by model on harder modules.

NumbersPrimitive Data Prefetcher Core coverage ranged 7.93%–98.84% across models (Table III)

Practical UseExpect large model-to-model variation; validate multiple LLMs and pick the one that performs best for your DUT.

Evidence RefTable III

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Coverage range on evaluated DUTs≈89.7%–100% on many DUTs with optimized promptsnaive constrained-random testing (CRT)CRT failed on hard DUTs where LLMs reached high coverage8 DUTs, 3883 coverage bins totalAbstract; Table IIIAbstract; Table III
Primitive Data Prefetcher Core coverage (example)7.93%–98.84% (model dependent)CRT 0%up to +98.84 percentage points vs CRTPrimitive Data Prefetcher Core (one DUT)Table IIITable III

What To Try In 7 Days

Run LLM4DV on one simple DUT to compare against your CRT baseline.

Test 2–3 LLMs and measure coverage and messages-to-converge.

Add a few-shot example and, if small, include the DUT source in the prompt to see gains.

Agent Features

Memory
best-iterative-message buffer (keeps successful past messages)
Planning
dialogue schedulingmissed-bin sampling
Tool Use
drives simulator via testbench (no external retrieval)
Frameworks
LLM4DV
Is Agentic

Yes

Architectures
chat-style LLM interaction

Optimization Features

Token Efficiency
message buffer and best-iterative-message sampling to fit context window

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

repositories for DUTs listed in references (links in paper)

Risks & Boundaries

Limitations

Evaluations cover eight DUTs only; results may not generalize to larger or industrial designs.

Context window limits prevent always including DUT source code.

When Not To Use

When formal proof-of-correctness is required instead of empirical coverage.

When the DUT source is too large to include and no good prompt workaround exists.

Failure Modes

LLM repeats past mistakes and stalls (mitigated by dialogue restart).

Hallucinated or invalid stimuli if prompts are unclear.

Core Entities

Models

gpt-3.5-turbollama-2-70b-chatclaude-3-sonnetcodellama-70b-instructllama-3-70b-instructclaude-3.5-sonnet

Metrics

coverage rateeffective message countaverage message count

Benchmarks

LLM4DV