Use formal EDA feedback inside a multi-agent controller to improve Verilog generation without expensive fine-tuning.

September 24, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Amulya Bhattaram, Janani Ramamoorthy, Ranit Gupta, Diana Marculescu, Dimitrios Stamoulis

Links

Abstract / PDF

Why It Matters For Business

You can improve RTL code generation accuracy and tune for hardware metrics without expensive model re-training. That cuts data and compute cost and lets engineering teams iterate quickly with existing LLMs.

Summary TLDR

VeriMaAS is a multi-agent controller that adaptively composes prompting operators (CoT, ReAct, Self-Refine, Debate, etc.) and uses synthesis/verification logs (Yosys, OpenSTA) as feedback. On standard RTL benchmarks it raises pass@k accuracy versus single-agent prompting and some fine-tuned baselines, while needing only a few hundred examples to tune the controller instead of tens of thousands for full fine-tuning.

Problem Statement

HDL/RTL code generation suffers from scarce public data and high fine-tuning costs. Existing single-agent prompting or fine-tuned models either require large supervision or high inference cost. The paper asks: can we use automated multi-agent workflows that read EDA tool feedback to find good Verilog without heavy fine-tuning?

Main Contribution

VeriMaAS: a cascading multi-agent controller that adaptively selects prompting operators and uses synthesis/verification logs to guide generation.

A lightweight tuning procedure for the controller that needs only a few hundred examples to set per-stage thresholds (versus tens of thousands for full fine-tuning).

Demonstration on VerilogEval and VeriThoughts benchmarks showing improved pass@k and optional PPA-aware optimization (area/power/delay) using the same controller framework.

Key Findings

VeriMaAS increases top-1 syntactic/functional accuracy (pass@1) on evaluated RTL benchmarks.

NumbersQwen2.5-7B: VeriThoughts pass@1 44.90 -> 56.62 (+11.72) (Table 1).

Controller tuning needs only a few hundred examples instead of tens of thousands required for fine-tuning.

NumbersController tuned using 500 sampled VeriThoughts datapoints; described as 'a few hundred' and 'order-of-magnitude' less.

The controller can be re-optimized for hardware metrics (PPA) and yield large area/delay reductions, with trade-offs.

NumbersPPA-aware tuning reduced area by up to 28.79% on evaluated subsets (Table 3).

Results

VeriThoughts pass@1

ValueQwen2.5-7B + VeriMaAS = 56.62 (baseline 44.90) -> +11.72

BaselineQwen2.5-7B baseline 44.90

VeriThoughts pass@1

ValueGPT 4o-mini + VeriMaAS = 83.09 (baseline 80.64) -> +2.45

BaselineGPT 4o-mini baseline 80.64

PPA area reduction

ValueUp to 28.79% area reduction on selected tasks after PPA-aware tuning

BaselineStandard VeriMaAS without PPA objective

Who Should Care

What To Try In 7 Days

Run VeriMaAS prototype on a small internal set (≈500 tasks) and tune per-stage thresholds.

Plug Yosys and OpenSTA into generator loop to collect synthesis logs as feedback.

Test PPA-aware tuning on a few high-value kernels to see area/delay trade-offs before broad rollout.

Agent Features

Planning

  • stage-wise cascade planning

Tool Use

  • Yosys
  • OpenSTA
  • Skywater PDK

Frameworks

  • VeriMaAS

Is Agentic

true

Architectures

  • cascading controller
  • multi-agent operator sequences

Collaboration

  • multi-agent coordination via controller

Optimization Features

Token Efficiency

  • moderate token overhead vs single CoT; lower than iterative Self-Refine in many cases

System Optimization

  • PPA-aware controller objective to reduce area/delay

Training Optimization

  • threshold tuning using a few hundred examples

Inference Optimization

  • adaptive stopping based on failure percentage
  • controller trades tokens vs utility (λ=1e-3)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks and PPA gains are limited to the evaluated datasets and the Skywater 130nm flow.
  • PPA-aware tuning can trade accuracy for area/power on some tasks.
  • Some Verilog tasks (simple gates) have little room for PPA gains, so optimization impact varies by task.

When Not To Use

  • If you already have a heavily fine-tuned RTL model and cannot afford any extra inference tokens.
  • When commercial PDKs or proprietary EDA tools are required and not available to integrate.
  • For extremely small-scale tasks where controller overhead outweighs gains (e.g., trivial gate tasks).

Failure Modes

  • Controller may add token cost and latency compared to a single prompt chain.
  • PPA optimization can increase power or slightly reduce pass@10 for some benchmarks.
  • Controller thresholds tuned on one dataset may not transfer perfectly; retuning may be required.

Core Entities

Models

  • GPT 4o-mini
  • o4-mini
  • Qwen2.5-7B
  • Qwen2.5-14B
  • Qwen3-8B
  • Qwen3-14B
  • RTLCoder-7B
  • RTLCoder-DeepSeek-7B
  • VeriThoughts-14B
  • DeepSeek-R1-Qwen-14B

Metrics

  • pass@1
  • pass@10
  • tokens per query
  • Area (post-synthesis)
  • Power (static)
  • Delay

Datasets

  • VeriThoughts
  • VerilogEval
  • MetRex (synthesis benchmark)

Benchmarks

  • VeriThoughts
  • VerilogEval
  • MetRex