Use LLMs to auto-generate hardware test inputs and recover coverage that random testing misses

Overview

Decision SnapshotNeeds Validation

Results show promise on several DUTs but large model and prompt variance; more DUTs and open datasets are needed to generalize.

Citations9

Evidence Strength0.60

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 55%

Novelty: 65%

Authors

Zixi Zhang, Balint Szekely, Pedro Gimenes, Greg Chadwick, Hugo McNally, Jianyi Cheng, Robert Mullins, Yiren Zhao

Links

Abstract / PDF / Data

Why It Matters For Business

LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper presents LLM4DV, an open benchmarking framework that uses prompted large language models (LLMs) to generate test stimuli for hardware design verification. The authors test six LLMs on eight hardware modules and introduce six prompting improvements. On the evaluated modules, LLM-based generation matched or outperformed naive constrained-random testing and reached 89.7%–100% coverage on many targets when prompts were optimized. Results vary a lot by model and DUT; prompt design and supplying DUT code help most.

Problem Statement

Hardware verification needs many targeted test inputs (stimuli). Creating those stimuli takes most of a chip project's engineering time and expertise. Random testing misses hard-to-hit states. The paper asks: can LLMs reduce human effort by reasoning about coverage plans and proposing targeted stimuli?

Main Contribution

LLM4DV: an open framework that orchestrates LLMs to produce hardware test stimuli and measure coverage.

Six practical prompting enhancements (e.g., missed-bin sampling, best-iterative-message sampling, dialogue restarting, few-shot examples, include DUT code).

Key Findings

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Numbers100% coverage on Asynchronous FIFO & AMPLE Weight Bank (Table III)

Practical UseFor straightforward components, try LLM-driven stimulus generation first — it can hit all tracked bins and save manual test design.

Evidence RefTable III

Performance varies widely by model on harder modules.

NumbersPrimitive Data Prefetcher Core coverage ranged 7.93%–98.84% across models (Table III)

Practical UseExpect large model-to-model variation; validate multiple LLMs and pick the one that performs best for your DUT.

Evidence RefTable III

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Coverage range on evaluated DUTs	≈89.7%–100% on many DUTs with optimized prompts	naive constrained-random testing (CRT)	CRT failed on hard DUTs where LLMs reached high coverage	8 DUTs, 3883 coverage bins total	Abstract; Table III	Abstract; Table III
Primitive Data Prefetcher Core coverage (example)	7.93%–98.84% (model dependent)	CRT 0%	up to +98.84 percentage points vs CRT	Primitive Data Prefetcher Core (one DUT)	Table III	Table III

What To Try In 7 Days

Run LLM4DV on one simple DUT to compare against your CRT baseline.

Test 2–3 LLMs and measure coverage and messages-to-converge.

Add a few-shot example and, if small, include the DUT source in the prompt to see gains.

Agent Features

Memory

best-iterative-message buffer (keeps successful past messages)

Planning

dialogue schedulingmissed-bin sampling

Tool Use

drives simulator via testbench (no external retrieval)

Frameworks

LLM4DV

Is Agentic

Yes

Architectures

chat-style LLM interaction

Optimization Features

Token Efficiency

message buffer and best-iterative-message sampling to fit context window

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Data URLs

repositories for DUTs listed in references (links in paper)

Risks & Boundaries

Limitations

Evaluations cover eight DUTs only; results may not generalize to larger or industrial designs.

Context window limits prevent always including DUT source code.

When Not To Use

When formal proof-of-correctness is required instead of empirical coverage.

When the DUT source is too large to include and no good prompt workaround exists.

Failure Modes

LLM repeats past mistakes and stalls (mitigated by dialogue restart).

Hallucinated or invalid stimuli if prompts are unclear.

Core Entities

Models

gpt-3.5-turbollama-2-70b-chatclaude-3-sonnetcodellama-70b-instructllama-3-70b-instructclaude-3.5-sonnet

Metrics

coverage rateeffective message countaverage message count

Benchmarks

LLM4DV

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Performance varies widely by model on harder modules.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding