ChipExpert: Open-source LLM tuned for integrated-circuit design

July 26, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Ning Xu, Zhaoyang Zhang, Lei Qi, Wensuo Wang, Chao Zhang, Zihao Ren, Huaiyuan Zhang, Xin Cheng, Yanqi Zhang, Zhichao Liu, Qingwen Wei, Shiyang Wu, Lanlan Yang, Qianfeng Lu, Yiqun Ma, Mengyao Zhao, Junbo Liu, Yufan Song, Xin Geng, Jun Yang

Links

Abstract / PDF

Why It Matters For Business

ChipExpert provides an open, lower-cost assistant focused on IC design knowledge; it can speed onboarding, reduce expert time for Q&A, and be adapted into internal tools.

Summary TLDR

ChipExpert is an open-source 8B-parameter LLM adapted from Llama-3 and tuned specifically for integrated-circuit (IC) design. The authors built a 4.7B-token IC corpus, generated >70k domain QA pairs with a multi-agent system (ChipInstruct), continued pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO) alignment. They add a RAG layer (top-3 retrieved passages) and release ChatICD-Bench to evaluate IC knowledge. On their benchmark ChipExpert matches or exceeds GPT-4 on many IC tasks (notably EDA and several advanced subdomains). Model, code, and benchmark are available online.

Problem Statement

General LLMs lack deep, usable IC design knowledge. Students and engineers face high learning costs and limited accessible, accurate domain materials. The paper aims to build and evaluate an open-source LLM tailored to IC design so practitioners get more accurate, domain-aware answers.

Main Contribution

ChipExpert: an open-source IC-design-focused LLM built from Llama-3 8B and released on HuggingFace.

A 4.7B-token IC corpus (blended, with domain knowledge repeated 4x to 11.2B effective tokens) used for continued pretraining.

ChipInstruct: a multi-agent pipeline (GPT-4 agents) that synthesizes and vets >70k domain question-answer pairs for supervised fine-tuning.

Alignment via two-phase Direct Preference Optimization (DPO) with red-teaming and Llama Guard 2 checks.

RAG integration (embedding store + ANN retrieval of top-3 passages) to reduce hallucinations.

ChatICD-Bench: the first IC-design benchmark for foundational and advanced IC questions; released publicly.

Key Findings

ChipExpert beats GPT-4 on foundational EDA questions.

NumbersChipExpert 0.93 vs GPT-4 0.87

ChipExpert outperforms GPT-4 in many advanced IC subdomains.

NumbersOutperforms GPT-4 in 6 of 9 advanced subdomains; +0.28 in CIM

Domain-specific continued pretraining used significantly more domain tokens.

Numbers4.7B original tokens → domain repeated to 11.2B training tokens

RAG is used to reduce hallucination by attaching retrieved evidence.

Results

Human-eval score on EDA foundational questions

Value0.93 (ChipExpert)

Baseline0.87 (GPT-4)

Human-eval delta on compute-in-memory (advanced)

ValueChipExpert improves by +0.28

BaselineGPT-4

Advanced subdomain wins vs GPT-4

Value6/9 subdomains

BaselineGPT-4

Who Should Care

What To Try In 7 Days

Run the released ChipExpert model on a few internal IC Q&A cases to compare answers vs your expert results

Evaluate ChatICD-Bench on your use cases and add representative prompts

Add a RAG layer using your internal docs (embed + ANN, top-3 passages) to improve factuality quickly

Agent Features

Tool Use

  • RAG (vector DB + ANN retrieval)
  • LoRA
  • Flash Attention / GQA for efficiency

Frameworks

  • ChipInstruct
  • ModelLink
  • Marker

Architectures

  • Autoregressive transformer (Llama-3 8B base)
  • Instruction-tuned assistant

Collaboration

  • Multi-agent pipeline for data synthesis (ChipInstruct)

Optimization Features

Token Efficiency

  • Domain repeat strategy (repeat domain knowledge 4x in mix)

Model Optimization

  • LoRA

System Optimization

  • Trained on 8 Ascend-910B NPUs

Training Optimization

  • Continued pretraining on domain corpus
  • SFT
  • Two-phase DPO alignment (preference tuning)

Inference Optimization

  • Flash Attention
  • GQA (Group Query Attention)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Weaker performance reported on analog circuit domain compared to GPT-4
  • Pretraining relies on publicly available texts; may miss proprietary or newest datasets
  • Synthetic QA generation depends on multi-agent correctness and may inject noise
  • Alignment uses model-based checks (Llama Guard 2) and manual review; residual unsafe outputs possible

When Not To Use

  • High-assurance analog circuit design decisions without human verification
  • Tasks requiring diagram/graph interpretation (no multimodal model yet)
  • Proprietary-IP-sensitive contexts without private-data RAG safeguards

Failure Modes

  • Hallucinations when RAG retrieval misses or returns irrelevant passages
  • Overfitting if further fine-tuned without fresh domain data
  • Misclassification of safe/unsafe outputs due to automated safety checks

Core Entities

Models

  • ChipExpert-8B-Instruct (fine-tuned Llama-3 8B)
  • Llama-3 8B (base)

Metrics

  • Human expert rating (0-1)
  • Automatic LLM multi-agent scoring + referee debate

Datasets

  • Custom IC continue-pretraining corpus (4.7B tokens)
  • Supervised QA pairs (>70k)
  • ChatICD-Bench (released)

Benchmarks

  • ChatICD-Bench