Overview
The paper provides quantitative gains across embedding, reranker, generator, and end-to-end RAG on two benchmarks, but datasets are small and several labels were produced by GPT-4 which can bias results.
Citations1
Evidence Strength0.85
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Specialized RAG reduces wrong answers on complex EDA docs, improving self-serve support and lowering costly human support for tooling documentation.
Who Should Care
Summary TLDR
Off-the-shelf RAG systems miss EDA specifics. The authors build RAG-EDA: contrastive-finetuned embeddings, a reranker trained with GPT-4 supervision, and a two-stage domain-finetuned generator. They release ORD-QA (90 QA triplets). On ORD-QA, their embedding raises recall@20 from 0.66 to 0.733, their reranker improves recall@5 from 0.522 to 0.671, and the end-to-end RAG-EDA improves UniEval by ~0.07 absolute vs prior flows. Results are reported on OpenROAD docs and one commercial EDA tool. (All numbers are on the paper's benchmarks.)
Problem Statement
EDA tool documentation is dense and uses narrow terminology. Generic RAG components (embeddings, rerankers, generators) often retrieve weakly-related passages or produce wrong answers because they lack EDA knowledge and fine-grained filtering for similarly-worded but irrelevant docs.
Main Contribution
RAG-EDA: a full RAG pipeline customized for EDA documentation QA (retriever, reranker, generator).
Contrastive finetuning of text embeddings using EDA triplets to improve semantic retrieval.
Key Findings
Domain-finetuned embedding improves dense retrieval recall.
Contrasted reranker reduces weakly-related passages hitting top ranks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| embedding recall@20 | 0.733 | bge-large-en-v1.5 0.66 | +0.073 | ORD-QA | Our finetuned embedding vs baselines | Table 1 |
| reranker recall@5 | 0.671 | bge-reranker-large 0.522 | +0.149 | ORD-QA | Contrastive-finetuned reranker improves top-5 recall | Table 2 |
What To Try In 7 Days
Collect a small set (hundreds) of domain Q/A triplets and finetune embedding with contrastive sampling.
Swap to hybrid retrieval (BM25 + dense) and add a lightweight reranker trained on a small labeled set.
Pretrain an existing open chat model on a few textbook chunks and instruction-tune on generated QA pairs; measure UniEval/BLEU on a held-out set.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
ORD-QA is small (90 questions) and may not cover all EDA edge cases.
Many training/labeling artifacts rely on GPT-3.5/GPT-4 synthetic examples, which can bias models.
When Not To Use
You cannot collect domain-specific queries or labels for contrastive finetuning.
Your documentation is multimodal (diagrams, netlists) and not captured by text chunks alone.
Failure Modes
Weakly-related document slips into context and causes generator hallucination.
Generator overfits textbook language and misses practical command nuances.

