Use LLMs to auto-generate hardware test inputs and recover coverage that random testing misses

October 6, 20237 min

Overview

Production Readiness

0.55

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

9

Authors

Zixi Zhang, Balint Szekely, Pedro Gimenes, Greg Chadwick, Hugo McNally, Jianyi Cheng, Robert Mullins, Yiren Zhao

Links

Abstract / PDF

Why It Matters For Business

LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.

Summary TLDR

This paper presents LLM4DV, an open benchmarking framework that uses prompted large language models (LLMs) to generate test stimuli for hardware design verification. The authors test six LLMs on eight hardware modules and introduce six prompting improvements. On the evaluated modules, LLM-based generation matched or outperformed naive constrained-random testing and reached 89.7%–100% coverage on many targets when prompts were optimized. Results vary a lot by model and DUT; prompt design and supplying DUT code help most.

Problem Statement

Hardware verification needs many targeted test inputs (stimuli). Creating those stimuli takes most of a chip project's engineering time and expertise. Random testing misses hard-to-hit states. The paper asks: can LLMs reduce human effort by reasoning about coverage plans and proposing targeted stimuli?

Main Contribution

LLM4DV: an open framework that orchestrates LLMs to produce hardware test stimuli and measure coverage.

Six practical prompting enhancements (e.g., missed-bin sampling, best-iterative-message sampling, dialogue restarting, few-shot examples, include DUT code).

Evaluation of six commercial/open LLMs across eight device-under-test (DUT) modules with defined coverage plans and metrics.

Open-sourcing of the framework, prompts, and the DUT modules to let others reproduce and extend the benchmark.

Key Findings

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Numbers100% coverage on Asynchronous FIFO & AMPLE Weight Bank (Table III)

Performance varies widely by model on harder modules.

NumbersPrimitive Data Prefetcher Core coverage ranged 7.93%–98.84% across models (Table III)

Prompting and extra context materially improve results.

NumbersBest runs often marked with few-shot (*) or DUT-code (†) in Table III

LLM4DV outperformed naive constrained-random testing (CRT) on evaluated hard cases.

NumbersCRT got 0% on Primitive Data Prefetcher Core vs LLMs up to 98.84% (Table III)

Results

Coverage range on evaluated DUTs

Value≈89.7%–100% on many DUTs with optimized prompts

Baselinenaive constrained-random testing (CRT)

Primitive Data Prefetcher Core coverage (example)

Value7.93%–98.84% (model dependent)

BaselineCRT 0%

Convergence (messages to reach top coverage)

ValueEffective message counts vary; some top runs use 1–36 effective messages

Baselinefaster convergence preferred vs many messages

Who Should Care

What To Try In 7 Days

Run LLM4DV on one simple DUT to compare against your CRT baseline.

Test 2–3 LLMs and measure coverage and messages-to-converge.

Add a few-shot example and, if small, include the DUT source in the prompt to see gains.

Agent Features

Memory

  • best-iterative-message buffer (keeps successful past messages)

Planning

  • dialogue scheduling
  • missed-bin sampling

Tool Use

  • drives simulator via testbench (no external retrieval)

Frameworks

  • LLM4DV

Is Agentic

true

Architectures

  • chat-style LLM interaction

Optimization Features

Token Efficiency

  • message buffer and best-iterative-message sampling to fit context window

Reproducibility

Data Urls

  • repositories for DUTs listed in references (links in paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluations cover eight DUTs only; results may not generalize to larger or industrial designs.
  • Context window limits prevent always including DUT source code.
  • Large variance across LLMs means outcomes depend on model choice.
  • Trials limited to 700 messages; longer interactions not evaluated.

When Not To Use

  • When formal proof-of-correctness is required instead of empirical coverage.
  • When the DUT source is too large to include and no good prompt workaround exists.
  • For safety-critical systems until broader benchmarks validate reliability.

Failure Modes

  • LLM repeats past mistakes and stalls (mitigated by dialogue restart).
  • Hallucinated or invalid stimuli if prompts are unclear.
  • Failure to hit deeply nested or rare coverage bins despite many messages.
  • High monetary cost if many API calls are needed to converge.

Core Entities

Models

  • gpt-3.5-turbo
  • llama-2-70b-chat
  • claude-3-sonnet
  • codellama-70b-instruct
  • llama-3-70b-instruct
  • claude-3.5-sonnet

Metrics

  • coverage rate
  • effective message count
  • average message count

Benchmarks

  • LLM4DV