bRAGgen: a self-updating RAG system that pulls real-time medical evidence for bariatric surgery Q&A

May 22, 20258 min

Overview

Decision SnapshotNeeds Validation

The system is practical: evaluated with expert review and LLM-as-judge, shows improved scores, and runs edits in tens of seconds; main limits are a single expert reviewer and prototype-scale testing.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

bRAGgen reduces the risk of outdated or inaccurate patient guidance by auto-fetching and integrating authoritative evidence, improving answer quality while keeping runtime latency low.

Who Should Care

Summary TLDR

The paper presents bRAGgen, an adaptive retrieval-augmented generation (RAG) system for bariatric surgery patient Q&A and bRAGq, a 1,302-question expert-validated dataset. bRAGgen uses a small semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo API) that favors authoritative domains ('.gov' and '.edu'), LoRA adaptation on Llama3-8B, and an online learning loop gated by confidence (perplexity threshold τp=4.5, cache similarity τc=0.7). In expert and LLM-as-judge tests, bRAGgen+Llama3-8B scored best (expert avg 4.51 vs best baseline 4.05; LLM-as-judge 4.44 vs 3.96). Most update cycles complete in 10–20s. Code and data are on GitHub.

Problem Statement

Patients need timely, evidence-based bariatric care guidance, but LLMs have stale knowledge and static RAG corpora create noise. The paper targets an automated system that detects low confidence and integrates fresh, authoritative medical evidence into the model in real time.

Main Contribution

bRAGgen: an adaptive, confidence-aware RAG system that triggers retrieval and parameter adaptation when outputs are uncertain.

bRAGq: a domain-specific dataset of 1,302 bariatric surgery questions validated by a board-certified bariatric surgeon.

Key Findings

bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.

NumbersExpert avg 4.51 (bRAGgen Llama3-8B) vs 4.05 (best baseline MedGraphRAG); Δ+0.46

Practical UseUse confidence-gated retrieval + LoRA edits to raise clinical answer quality versus static offline RAG.

Evidence RefTable 3; Section 7.1

LLM-as-Judge (ChatGPT-4o) replicated the expert ranking and magnitude of improvements.

NumbersLLM-as-Judge avg 4.44 (bRAGgen Llama3-8B) vs 3.96 (best baseline); Δ+0.48; Spearman ρ=0.94 vs expert

Practical UseYou can use an LLM-as-judge as a scalable proxy for early-stage evaluation to reduce expert time.

Evidence RefTable 4; Section 7.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Expert average score (Factuality+Relevance+Comprehensiveness)4.51MedGraphRAG 4.05+0.46Expert review (105 responses) on bRAGqTable 3; Section 7.1Table 3
LLM-as-Judge average score4.44Best baseline 3.96+0.48ChatGPT-4o judging on bRAGqTable 4; Section 7.2Table 4

What To Try In 7 Days

Run a small prototype: Llama3-8B + LoRA + a 500-doc semantic cache (Faiss) and test on 50 bRAGq questions.

Implement a perplexity gate (τp=4.5) to trigger retrieval only when answers are uncertain.

Use DuckDuckGo + domain filters (.gov/.edu) and BM25 to fetch top authoritative documents for a few sample queries.

Agent Features

Memory
Semantic cache (query-document pairs)Experience buffer with Faiss-based KNN management
Planning
Perplexity-based gating to trigger retrieval and adaptationMDP-guided search prioritizing authoritative domains
Tool Use
DuckDuckGo APIPubMed / PMC / NIH as primary sources
Frameworks
LoRA
Is Agentic

Yes

Architectures
LoRASemantic cache (Faiss) + web retriever (MDP-guided)

Optimization Features

Token Efficiency
Context selection via cache and BM25 to limit prompt length
Infra Optimization
Faiss indexing for fast similarity search
Model Optimization
LoRA
System Optimization
Cache size cap (example 500 docs) to bound compute
Training Optimization
Online learning with regularized cross-entropy and Frobenius-norm regularizerExperience buffer updates via nearest-neighbor diversity
Inference Optimization
Semantic cache lookup to reduce retrieval latencyConstrained decoding to avoid unsafe words

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single expert reviewer for human evaluation (105 examples) limits clinical generality.

Scalability: many cumulative edits may cause interference or capacity saturation.

When Not To Use

For autonomous high-stakes clinical decisions without clinician supervision.

In environments with no internet or restricted web access (system relies on web retrieval).

Failure Modes

Retrieval returns conflicting or low-quality documents that increase hallucinations.

Cache eviction removes still-relevant documents, causing knowledge gaps.

Core Entities

Models

Llama3-8BPhi-3Mistral InstructChatGPT-4oRAG 2MedGraphRAG

Metrics

FactualityClinical RelevanceComprehensivenessAverage scoreSpearman correlation

Datasets

bRAGqPubMedQA

Benchmarks

bRAGq (this work)

Context Entities

Models

ASMBS guidelines (cited source)

Datasets

PubMed / PMC / NIH content (retrieval sources)