Ask-EDA: a Slack-ready design chatbot that combines hybrid retrieval and an abbreviation lookup to reduce hallucinations

June 3, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Luyao Shi, Michael Kazda, Bradley Sears, Nick Shropshire, Ruchir Puri

Links

Abstract / PDF

Why It Matters For Business

A hybrid RAG layer plus a small abbreviation lookup can cut wrong answers and boost recall on internal technical queries, speeding engineering work and reducing time spent hunting docs.

Summary TLDR

Ask-EDA is a domain chat assistant for chip design that pairs an LLM with a hybrid RAG retrieval layer (dense + sparse + reciprocal rank fusion) and an abbreviation de-hallucination module. Evaluated on three 100-item, in-domain test sets, hybrid RAG improved recall vs no-RAG (40%+ on q2a-100, 60%+ on cmds-100) and abbreviation lookup (ADH) improved recall on abbr-100 by >70%. The system runs over Slack and returns sources for user review. Key limits: a small tailored knowledge base (≈400 MB, IBM-specific), 249 abbreviations, and remaining LLM recall/hallucination issues.

Problem Statement

Design engineers struggle to find correct, up-to-date technical guidance and command syntax across scattered internal docs and Slack. Off-the-shelf LLMs hallucinate or lack current/institutional knowledge. The goal is a 24/7 assistant that returns accurate, sourced answers and reduces hallucinated abbreviation expansions.

Main Contribution

Built Ask-EDA: a chat assistant for electronic design that combines an LLM, hybrid retrieval (dense + sparse), and abbreviation de-hallucination.

Implemented a hybrid search pipeline using sentence-transformer dense vectors, BM25 sparse index, and reciprocal rank fusion (RRF).

Created three domain test sets (q2a-100, cmds-100, abbr-100) and measured ROUGE-Lsum F1 and Recall to quantify gains.

Integrated with Slack for conversational use and source review; provided a practical system prompt and deployment details.

Key Findings

Hybrid RAG substantially increases answer recall versus no retrieval.

Numbersq2a-100: >40% recall improvement vs no-RAG; cmds-100: >60% recall improvement vs no-RAG

Abbreviation de-hallucination (ADH) greatly reduces wrong expansions.

Numbersabbr-100: >70% recall improvement with ADH

Model choice affects extraction quality: Granite-13b-chat-v2.1 gives higher F1, Llama2-13b-chat can have similar or better recall but lower F1.

Without RAG, the LLMs had zero recall on command lookup (cmds-100).

Numberscmds-100: no-RAG Recall = 0

Results

q2a-100 Recall improvement (hybrid vs none)

Value>40% relative increase

Baselineno RAG

cmds-100 Recall improvement (hybrid vs none)

Value>60% relative increase

Baselineno RAG (Recall=0)

abbr-100 Recall improvement with ADH

Value>70% relative increase

Baselinehybrid RAG without ADH

cmds-100 no-RAG recall

Value0.0

Baselineno RAG

Model comparison (F1 vs Recall)

ValueGranite higher F1; Llama2 similar or higher Recall in some cases

Who Should Care

What To Try In 7 Days

Build a small hybrid index (dense + BM25) over your most-used internal docs and test recall on 50 common queries.

Add a curated abbreviation dictionary and inject exact matches into prompts for abbreviation-heavy domains.

Expose retrieval sources in the UI so engineers can verify answers quickly.

Agent Features

Memory

  • short-term chat history included from recent prior questions

Tool Use

  • Slack API (conversational interface)
  • Source listing for user verification

Frameworks

  • LangChain for ingestion
  • ChromaDB for dense storage

Is Agentic

true

Architectures

  • single-turn LLM with retrieval-augmented context

Collaboration

  • SME-built abbreviation dictionary; feedback collection via Slack (not used in eval)

Optimization Features

Token Efficiency

  • chunking documents to control context length (2048 chunk size, 256 overlap)

System Optimization

  • reciprocal rank fusion (RRF) to merge dense and sparse results

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Knowledge base is IBM-specific and ~400MB; results may not generalize to other orgs.
  • Abbreviation dictionary has 249 entries; only ~25% are general industry terms.
  • Evaluations use small 100-item test sets per task; results may be noisy.
  • Some LLMs still fail to recall injected abbreviations or to produce concise final answers.
  • No RLHF or fine-tuning on the ingested design data was applied in this study.

When Not To Use

  • When you need perfect recall on open-ended, up-to-the-minute sources not ingested into the index.
  • When handling highly sensitive or confidential data unless retrieval and access controls are hardened.
  • When you require full public reproducibility — datasets and code are internal.

Failure Modes

  • LLM ignores injected abbreviation info and hallucinates expansions despite ADH.
  • Hybrid context overwhelms the LLM leading to lower F1 even with higher recall.
  • Sparse or dense retrieval misses key command docs if chunking or indexing parameters are suboptimal.

Core Entities

Models

  • Granite-13b-chat-v2.1
  • Llama2-13b-chat
  • all-MiniLM-L6-v2 (embedder)

Metrics

  • ROUGE-Lsum F1
  • Recall

Datasets

  • q2a-100
  • cmds-100
  • abbr-100
  • internal doc corpus (≈400MB; ~10.2k command pages; ~5k params; 30 slack channels; 18k Q&A)