Overview
Production Readiness
0.55
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
9
Why It Matters For Business
LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.
Summary TLDR
This paper presents LLM4DV, an open benchmarking framework that uses prompted large language models (LLMs) to generate test stimuli for hardware design verification. The authors test six LLMs on eight hardware modules and introduce six prompting improvements. On the evaluated modules, LLM-based generation matched or outperformed naive constrained-random testing and reached 89.7%–100% coverage on many targets when prompts were optimized. Results vary a lot by model and DUT; prompt design and supplying DUT code help most.
Problem Statement
Hardware verification needs many targeted test inputs (stimuli). Creating those stimuli takes most of a chip project's engineering time and expertise. Random testing misses hard-to-hit states. The paper asks: can LLMs reduce human effort by reasoning about coverage plans and proposing targeted stimuli?
Main Contribution
LLM4DV: an open framework that orchestrates LLMs to produce hardware test stimuli and measure coverage.
Six practical prompting enhancements (e.g., missed-bin sampling, best-iterative-message sampling, dialogue restarting, few-shot examples, include DUT code).
Evaluation of six commercial/open LLMs across eight device-under-test (DUT) modules with defined coverage plans and metrics.
Open-sourcing of the framework, prompts, and the DUT modules to let others reproduce and extend the benchmark.
Key Findings
For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.
Performance varies widely by model on harder modules.
Prompting and extra context materially improve results.
LLM4DV outperformed naive constrained-random testing (CRT) on evaluated hard cases.
Results
Coverage range on evaluated DUTs
Primitive Data Prefetcher Core coverage (example)
Convergence (messages to reach top coverage)
Who Should Care
What To Try In 7 Days
Run LLM4DV on one simple DUT to compare against your CRT baseline.
Test 2–3 LLMs and measure coverage and messages-to-converge.
Add a few-shot example and, if small, include the DUT source in the prompt to see gains.
Agent Features
Memory
- best-iterative-message buffer (keeps successful past messages)
Planning
- dialogue scheduling
- missed-bin sampling
Tool Use
- drives simulator via testbench (no external retrieval)
Frameworks
- LLM4DV
Is Agentic
true
Architectures
- chat-style LLM interaction
Optimization Features
Token Efficiency
- message buffer and best-iterative-message sampling to fit context window
Reproducibility
Data Urls
- repositories for DUTs listed in references (links in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluations cover eight DUTs only; results may not generalize to larger or industrial designs.
- Context window limits prevent always including DUT source code.
- Large variance across LLMs means outcomes depend on model choice.
- Trials limited to 700 messages; longer interactions not evaluated.
When Not To Use
- When formal proof-of-correctness is required instead of empirical coverage.
- When the DUT source is too large to include and no good prompt workaround exists.
- For safety-critical systems until broader benchmarks validate reliability.
Failure Modes
- LLM repeats past mistakes and stalls (mitigated by dialogue restart).
- Hallucinated or invalid stimuli if prompts are unclear.
- Failure to hit deeply nested or rare coverage bins despite many messages.
- High monetary cost if many API calls are needed to converge.
Core Entities
Models
- gpt-3.5-turbo
- llama-2-70b-chat
- claude-3-sonnet
- codellama-70b-instruct
- llama-3-70b-instruct
- claude-3.5-sonnet
Metrics
- coverage rate
- effective message count
- average message count
Benchmarks
- LLM4DV

