bRAGgen: a self-updating RAG system that pulls real-time medical evidence for bariatric surgery Q&A

Overview

Decision SnapshotNeeds Validation

The system is practical: evaluated with expert review and LLM-as-judge, shows improved scores, and runs edits in tens of seconds; main limits are a single expert reviewer and prototype-scale testing.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

bRAGgen reduces the risk of outdated or inaccurate patient guidance by auto-fetching and integrating authoritative evidence, improving answer quality while keeping runtime latency low.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The paper presents bRAGgen, an adaptive retrieval-augmented generation (RAG) system for bariatric surgery patient Q&A and bRAGq, a 1,302-question expert-validated dataset. bRAGgen uses a small semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo API) that favors authoritative domains ('.gov' and '.edu'), LoRA adaptation on Llama3-8B, and an online learning loop gated by confidence (perplexity threshold τp=4.5, cache similarity τc=0.7). In expert and LLM-as-judge tests, bRAGgen+Llama3-8B scored best (expert avg 4.51 vs best baseline 4.05; LLM-as-judge 4.44 vs 3.96). Most update cycles complete in 10–20s. Code and data are on GitHub.

Problem Statement

Patients need timely, evidence-based bariatric care guidance, but LLMs have stale knowledge and static RAG corpora create noise. The paper targets an automated system that detects low confidence and integrates fresh, authoritative medical evidence into the model in real time.

Main Contribution

bRAGgen: an adaptive, confidence-aware RAG system that triggers retrieval and parameter adaptation when outputs are uncertain.

bRAGq: a domain-specific dataset of 1,302 bariatric surgery questions validated by a board-certified bariatric surgeon.

Key Findings

bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.

NumbersExpert avg 4.51 (bRAGgen Llama3-8B) vs 4.05 (best baseline MedGraphRAG); Δ+0.46

Practical UseUse confidence-gated retrieval + LoRA edits to raise clinical answer quality versus static offline RAG.

Evidence RefTable 3; Section 7.1

LLM-as-Judge (ChatGPT-4o) replicated the expert ranking and magnitude of improvements.

NumbersLLM-as-Judge avg 4.44 (bRAGgen Llama3-8B) vs 3.96 (best baseline); Δ+0.48; Spearman ρ=0.94 vs expert

Practical UseYou can use an LLM-as-judge as a scalable proxy for early-stage evaluation to reduce expert time.

Evidence RefTable 4; Section 7.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Expert average score (Factuality+Relevance+Comprehensiveness)	4.51	MedGraphRAG 4.05	+0.46	Expert review (105 responses) on bRAGq	Table 3; Section 7.1	Table 3
LLM-as-Judge average score	4.44	Best baseline 3.96	+0.48	ChatGPT-4o judging on bRAGq	Table 4; Section 7.2	Table 4

What To Try In 7 Days

Run a small prototype: Llama3-8B + LoRA + a 500-doc semantic cache (Faiss) and test on 50 bRAGq questions.

Implement a perplexity gate (τp=4.5) to trigger retrieval only when answers are uncertain.

Use DuckDuckGo + domain filters (.gov/.edu) and BM25 to fetch top authoritative documents for a few sample queries.

Agent Features

Memory

Semantic cache (query-document pairs)Experience buffer with Faiss-based KNN management

Planning

Perplexity-based gating to trigger retrieval and adaptationMDP-guided search prioritizing authoritative domains

Tool Use

DuckDuckGo APIPubMed / PMC / NIH as primary sources

Frameworks

LoRA

Is Agentic

Yes

Architectures

LoRASemantic cache (Faiss) + web retriever (MDP-guided)

Optimization Features

Token Efficiency

Context selection via cache and BM25 to limit prompt length

Infra Optimization

Faiss indexing for fast similarity search

Model Optimization

LoRA

System Optimization

Cache size cap (example 500 docs) to bound compute

Training Optimization

Online learning with regularized cross-entropy and Frobenius-norm regularizerExperience buffer updates via nearest-neighbor diversity

Inference Optimization

Semantic cache lookup to reduce retrieval latencyConstrained decoding to avoid unsafe words

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yashkumaratri/bRAGgen

Data URLs

https://github.com/yashkumaratri/bRAGgen

Risks & Boundaries

Limitations

Single expert reviewer for human evaluation (105 examples) limits clinical generality.

Scalability: many cumulative edits may cause interference or capacity saturation.

When Not To Use

For autonomous high-stakes clinical decisions without clinician supervision.

In environments with no internet or restricted web access (system relies on web retrieval).

Failure Modes

Retrieval returns conflicting or low-quality documents that increase hallucinations.

Cache eviction removes still-relevant documents, causing knowledge gaps.

Core Entities

Models

Llama3-8BPhi-3Mistral InstructChatGPT-4oRAG 2MedGraphRAG

Metrics

FactualityClinical RelevanceComprehensivenessAverage scoreSpearman correlation

Datasets

bRAGqPubMedQA

Benchmarks

bRAGq (this work)

Context Entities

Models

ASMBS guidelines (cited source)

Datasets

PubMed / PMC / NIH content (retrieval sources)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.

LLM-as-Judge (ChatGPT-4o) replicated the expert ranking and magnitude of improvements.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding