bRAGgen: a self-updating RAG system that pulls real-time medical evidence for bariatric surgery Q&A

May 22, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen

Links

Abstract / PDF

Why It Matters For Business

bRAGgen reduces the risk of outdated or inaccurate patient guidance by auto-fetching and integrating authoritative evidence, improving answer quality while keeping runtime latency low.

Summary TLDR

The paper presents bRAGgen, an adaptive retrieval-augmented generation (RAG) system for bariatric surgery patient Q&A and bRAGq, a 1,302-question expert-validated dataset. bRAGgen uses a small semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo API) that favors authoritative domains ('.gov' and '.edu'), LoRA adaptation on Llama3-8B, and an online learning loop gated by confidence (perplexity threshold τp=4.5, cache similarity τc=0.7). In expert and LLM-as-judge tests, bRAGgen+Llama3-8B scored best (expert avg 4.51 vs best baseline 4.05; LLM-as-judge 4.44 vs 3.96). Most update cycles complete in 10–20s. Code and data are on GitHub.

Problem Statement

Patients need timely, evidence-based bariatric care guidance, but LLMs have stale knowledge and static RAG corpora create noise. The paper targets an automated system that detects low confidence and integrates fresh, authoritative medical evidence into the model in real time.

Main Contribution

bRAGgen: an adaptive, confidence-aware RAG system that triggers retrieval and parameter adaptation when outputs are uncertain.

bRAGq: a domain-specific dataset of 1,302 bariatric surgery questions validated by a board-certified bariatric surgeon.

A modular architecture combining a semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo + BM25), LoRA-based parametric edits, online learning, and constrained decoding with BERTScore validation.

Two-phase evaluation: single-expert human review (105 responses) and large-scale LLM-as-judge (ChatGPT-4o) with high alignment (Spearman ρ=0.94).

Key Findings

bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.

NumbersExpert avg 4.51 (bRAGgen Llama3-8B) vs 4.05 (best baseline MedGraphRAG); Δ+0.46

LLM-as-Judge (ChatGPT-4o) replicated the expert ranking and magnitude of improvements.

NumbersLLM-as-Judge avg 4.44 (bRAGgen Llama3-8B) vs 3.96 (best baseline); Δ+0.48; Spearman ρ=0.94 vs expert

The bRAGq dataset size and provenance.

Numbers1,302 questions total: 611 from PubMedQA (201 of those flagged as not representative) + 691 synthetically generated

The self-update pipeline runs quickly enough for interactive use.

NumbersMost edit/update operations complete within 10–20 seconds

Results

Expert average score (Factuality+Relevance+Comprehensiveness)

Value4.51

BaselineMedGraphRAG 4.05

LLM-as-Judge average score

Value4.44

BaselineBest baseline 3.96

Expert–LLM-as-Judge rank correlation

Valueρ = 0.94

Pipeline update latency

Value10–20 seconds (majority)

Who Should Care

What To Try In 7 Days

Run a small prototype: Llama3-8B + LoRA + a 500-doc semantic cache (Faiss) and test on 50 bRAGq questions.

Implement a perplexity gate (τp=4.5) to trigger retrieval only when answers are uncertain.

Use DuckDuckGo + domain filters (.gov/.edu) and BM25 to fetch top authoritative documents for a few sample queries.

Agent Features

Memory

  • Semantic cache (query-document pairs)
  • Experience buffer with Faiss-based KNN management

Planning

  • Perplexity-based gating to trigger retrieval and adaptation
  • MDP-guided search prioritizing authoritative domains

Tool Use

  • DuckDuckGo API
  • PubMed / PMC / NIH as primary sources

Frameworks

  • LoRA

Is Agentic

true

Architectures

  • LoRA
  • Semantic cache (Faiss) + web retriever (MDP-guided)

Optimization Features

Token Efficiency

  • Context selection via cache and BM25 to limit prompt length

Infra Optimization

  • Faiss indexing for fast similarity search

Model Optimization

  • LoRA

System Optimization

  • Cache size cap (example 500 docs) to bound compute

Training Optimization

  • Online learning with regularized cross-entropy and Frobenius-norm regularizer
  • Experience buffer updates via nearest-neighbor diversity

Inference Optimization

  • Semantic cache lookup to reduce retrieval latency
  • Constrained decoding to avoid unsafe words

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single expert reviewer for human evaluation (105 examples) limits clinical generality.
  • Scalability: many cumulative edits may cause interference or capacity saturation.
  • Generalization beyond locally edited facts is unproven.
  • Reliance on web retrieval exposes system to source selection bias and possible conflicting evidence.
  • Regulatory and deployment risks for direct patient use without clinician oversight.

When Not To Use

  • For autonomous high-stakes clinical decisions without clinician supervision.
  • In environments with no internet or restricted web access (system relies on web retrieval).
  • When certified medical device-level validation and regulatory clearance are required.

Failure Modes

  • Retrieval returns conflicting or low-quality documents that increase hallucinations.
  • Cache eviction removes still-relevant documents, causing knowledge gaps.
  • Perplexity gate misfires (false negatives or positives) leading to missed updates or unnecessary edits.
  • Parametric updates overfit to recent documents and degrade earlier correct knowledge.

Core Entities

Models

  • Llama3-8B
  • Phi-3
  • Mistral Instruct
  • ChatGPT-4o
  • RAG 2
  • MedGraphRAG

Metrics

  • Factuality
  • Clinical Relevance
  • Comprehensiveness
  • Average score
  • Spearman correlation

Datasets

  • bRAGq
  • PubMedQA

Benchmarks

  • bRAGq (this work)

Context Entities

Models

  • ASMBS guidelines (cited source)

Datasets

  • PubMed / PMC / NIH content (retrieval sources)