Customize RAG for EDA docs: domain-tuned retriever, reranker, generator + ORD-QA benchmark

Overview

Decision SnapshotReady For Pilot

The paper provides quantitative gains across embedding, reranker, generator, and end-to-end RAG on two benchmarks, but datasets are small and several labels were produced by GPT-4 which can bias results.

Citations1

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yuan Pu, Zhuolun He, Tairu Qiu, Haoyuan Wu, Bei Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Specialized RAG reduces wrong answers on complex EDA docs, improving self-serve support and lowering costly human support for tooling documentation.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

Off-the-shelf RAG systems miss EDA specifics. The authors build RAG-EDA: contrastive-finetuned embeddings, a reranker trained with GPT-4 supervision, and a two-stage domain-finetuned generator. They release ORD-QA (90 QA triplets). On ORD-QA, their embedding raises recall@20 from 0.66 to 0.733, their reranker improves recall@5 from 0.522 to 0.671, and the end-to-end RAG-EDA improves UniEval by ~0.07 absolute vs prior flows. Results are reported on OpenROAD docs and one commercial EDA tool. (All numbers are on the paper's benchmarks.)

Problem Statement

EDA tool documentation is dense and uses narrow terminology. Generic RAG components (embeddings, rerankers, generators) often retrieve weakly-related passages or produce wrong answers because they lack EDA knowledge and fine-grained filtering for similarly-worded but irrelevant docs.

Main Contribution

RAG-EDA: a full RAG pipeline customized for EDA documentation QA (retriever, reranker, generator).

Contrastive finetuning of text embeddings using EDA triplets to improve semantic retrieval.

Key Findings

Domain-finetuned embedding improves dense retrieval recall.

Numbersrecall@20: 0.733 (ours) vs 0.66 (bge-large) and 0.634 (text-embedding-ada-002)

Practical UseFinetune embedding with domain triplets to retrieve more relevant EDA docs before reranking.

Evidence RefTable 1

Contrasted reranker reduces weakly-related passages hitting top ranks.

Numbersreranker recall@5: 0.671 (ours) vs 0.522 (bge-reranker-large) and 0.484 (RRF)

Practical UseUse reranker trained on domain-labeled positives/negatives to avoid misleading context that causes hallucinations.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
embedding recall@20	0.733	bge-large-en-v1.5 0.66	+0.073	ORD-QA	Our finetuned embedding vs baselines	Table 1
reranker recall@5	0.671	bge-reranker-large 0.522	+0.149	ORD-QA	Contrastive-finetuned reranker improves top-5 recall	Table 2

What To Try In 7 Days

Collect a small set (hundreds) of domain Q/A triplets and finetune embedding with contrastive sampling.

Swap to hybrid retrieval (BM25 + dense) and add a lightweight reranker trained on a small labeled set.

Pretrain an existing open chat model on a few textbook chunks and instruction-tune on generated QA pairs; measure UniEval/BLEU on a held-out set.

Optimization Features

Infra Optimization

trained on 16x A100 (40GB)

Model Optimization

LoRA

System Optimization

hybrid retrieval to reduce generator search space

Training Optimization

contrastive finetuning for embedding and reranker

Inference Optimization

4-bit quantization of generator for faster inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/lesliepy99/RAG-EDA

Data URLs

https://github.com/lesliepy99/RAG-EDA

Risks & Boundaries

Limitations

ORD-QA is small (90 questions) and may not cover all EDA edge cases.

Many training/labeling artifacts rely on GPT-3.5/GPT-4 synthetic examples, which can bias models.

When Not To Use

You cannot collect domain-specific queries or labels for contrastive finetuning.

Your documentation is multimodal (diagrams, netlists) and not captured by text chunks alone.

Failure Modes

Weakly-related document slips into context and causes generator hallucination.

Generator overfits textbook language and misses practical command nuances.

Core Entities

Models

bge-large-en-v1.5text-embedding-ada-002bge-reranker-largeQwen-14b-chatQwen1.5-14B-Chatllama-2-13B-chatBaichuan2-13B-ChatGPT-4Our embedding modelOur rerankerRAG-EDA-generator

Metrics

recall@kBLEUROUGE-LUniEval

Datasets

ORD-QA (90 QA triplets)embedding contrastive triplets (3,975)generator instruction-tuning (1,732 triplets)textbook chunks for pretrain (4,863 chunks)

Benchmarks

ORD-QACommercial EDA tool QA (60 triplets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Domain-finetuned embedding improves dense retrieval recall.

Contrasted reranker reduces weakly-related passages hitting top ranks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f