Use formal EDA feedback inside a multi-agent controller to improve Verilog generation without expensive fine-tuning.

Overview

Decision SnapshotNeeds Validation

The method is a practical system-level recipe: clear engineering value, public prototype, and numeric gains on benchmarks. Results are empirical and benchmark-limited; some trade-offs (tokens, power, pass@10) appear in tables.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Amulya Bhattaram, Janani Ramamoorthy, Ranit Gupta, Diana Marculescu, Dimitrios Stamoulis

Links

Abstract / PDF / Code

Why It Matters For Business

You can improve RTL code generation accuracy and tune for hardware metrics without expensive model re-training. That cuts data and compute cost and lets engineering teams iterate quickly with existing LLMs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

VeriMaAS is a multi-agent controller that adaptively composes prompting operators (CoT, ReAct, Self-Refine, Debate, etc.) and uses synthesis/verification logs (Yosys, OpenSTA) as feedback. On standard RTL benchmarks it raises pass@k accuracy versus single-agent prompting and some fine-tuned baselines, while needing only a few hundred examples to tune the controller instead of tens of thousands for full fine-tuning.

Problem Statement

HDL/RTL code generation suffers from scarce public data and high fine-tuning costs. Existing single-agent prompting or fine-tuned models either require large supervision or high inference cost. The paper asks: can we use automated multi-agent workflows that read EDA tool feedback to find good Verilog without heavy fine-tuning?

Main Contribution

VeriMaAS: a cascading multi-agent controller that adaptively selects prompting operators and uses synthesis/verification logs to guide generation.

A lightweight tuning procedure for the controller that needs only a few hundred examples to set per-stage thresholds (versus tens of thousands for full fine-tuning).

Key Findings

VeriMaAS increases top-1 syntactic/functional accuracy (pass@1) on evaluated RTL benchmarks.

NumbersQwen2.5-7B: VeriThoughts pass@1 44.90 -> 56.62 (+11.72) (Table 1).

Practical UseIf you run open LLMs on RTL tasks, wrapping them with VeriMaAS can yield large absolute gains in top-1 correct designs without changing model weights. Try the controller before investing in fine-tuning.

Evidence RefTable 1

Controller tuning needs only a few hundred examples instead of tens of thousands required for fine-tuning.

NumbersController tuned using 500 sampled VeriThoughts datapoints; described as 'a few hundred' and 'order-of-magnitude' less.

Practical UseYou can get most of the multi-agent benefit by collecting a small validation set (~500 tasks) and tuning thresholds, saving large compute and data costs.

Evidence RefSection 2 (controller tuning)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
VeriThoughts pass@1	Qwen2.5-7B + VeriMaAS = 56.62 (baseline 44.90) -> +11.72	Qwen2.5-7B baseline 44.90	+11.72	VeriThoughts	Table 1 (pass@1 numbers for Qwen2.5-7B)	Table 1
VeriThoughts pass@1	GPT 4o-mini + VeriMaAS = 83.09 (baseline 80.64) -> +2.45	GPT 4o-mini baseline 80.64	+2.45	VeriThoughts	Table 1 (GPT 4o-mini rows)	Table 1

What To Try In 7 Days

Run VeriMaAS prototype on a small internal set (≈500 tasks) and tune per-stage thresholds.

Plug Yosys and OpenSTA into generator loop to collect synthesis logs as feedback.

Test PPA-aware tuning on a few high-value kernels to see area/delay trade-offs before broad rollout.

Agent Features

Planning

stage-wise cascade planning

Tool Use

YosysOpenSTASkywater PDK

Frameworks

VeriMaAS

Is Agentic

Yes

Architectures

cascading controllermulti-agent operator sequences

Collaboration

multi-agent coordination via controller

Optimization Features

Token Efficiency

moderate token overhead vs single CoT; lower than iterative Self-Refine in many cases

System Optimization

PPA-aware controller objective to reduce area/delay

Training Optimization

threshold tuning using a few hundred examples

Inference Optimization

adaptive stopping based on failure percentagecontroller trades tokens vs utility (λ=1e-3)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/dstamoulis/maas/tree/verimaas/verithoughts: documentation-migration in progress

Risks & Boundaries

Limitations

Benchmarks and PPA gains are limited to the evaluated datasets and the Skywater 130nm flow.

PPA-aware tuning can trade accuracy for area/power on some tasks.

When Not To Use

If you already have a heavily fine-tuned RTL model and cannot afford any extra inference tokens.

When commercial PDKs or proprietary EDA tools are required and not available to integrate.

Failure Modes

Controller may add token cost and latency compared to a single prompt chain.

PPA optimization can increase power or slightly reduce pass@10 for some benchmarks.

Core Entities

Models

GPT 4o-minio4-miniQwen2.5-7BQwen2.5-14BQwen3-8BQwen3-14BRTLCoder-7BRTLCoder-DeepSeek-7BVeriThoughts-14BDeepSeek-R1-Qwen-14B

Metrics

pass@1pass@10tokens per queryArea (post-synthesis)Power (static)Delay

Datasets

VeriThoughtsVerilogEvalMetRex (synthesis benchmark)

Benchmarks

VeriThoughtsVerilogEvalMetRex

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VeriMaAS increases top-1 syntactic/functional accuracy (pass@1) on evaluated RTL benchmarks.

Controller tuning needs only a few hundred examples instead of tens of thousands required for fine-tuning.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding