Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
bRAGgen reduces the risk of outdated or inaccurate patient guidance by auto-fetching and integrating authoritative evidence, improving answer quality while keeping runtime latency low.
Summary TLDR
The paper presents bRAGgen, an adaptive retrieval-augmented generation (RAG) system for bariatric surgery patient Q&A and bRAGq, a 1,302-question expert-validated dataset. bRAGgen uses a small semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo API) that favors authoritative domains ('.gov' and '.edu'), LoRA adaptation on Llama3-8B, and an online learning loop gated by confidence (perplexity threshold τp=4.5, cache similarity τc=0.7). In expert and LLM-as-judge tests, bRAGgen+Llama3-8B scored best (expert avg 4.51 vs best baseline 4.05; LLM-as-judge 4.44 vs 3.96). Most update cycles complete in 10–20s. Code and data are on GitHub.
Problem Statement
Patients need timely, evidence-based bariatric care guidance, but LLMs have stale knowledge and static RAG corpora create noise. The paper targets an automated system that detects low confidence and integrates fresh, authoritative medical evidence into the model in real time.
Main Contribution
bRAGgen: an adaptive, confidence-aware RAG system that triggers retrieval and parameter adaptation when outputs are uncertain.
bRAGq: a domain-specific dataset of 1,302 bariatric surgery questions validated by a board-certified bariatric surgeon.
A modular architecture combining a semantic cache (Faiss + SentenceTransformer), an MDP-guided web retriever (DuckDuckGo + BM25), LoRA-based parametric edits, online learning, and constrained decoding with BERTScore validation.
Two-phase evaluation: single-expert human review (105 responses) and large-scale LLM-as-judge (ChatGPT-4o) with high alignment (Spearman ρ=0.94).
Key Findings
bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.
LLM-as-Judge (ChatGPT-4o) replicated the expert ranking and magnitude of improvements.
The bRAGq dataset size and provenance.
The self-update pipeline runs quickly enough for interactive use.
Results
Expert average score (Factuality+Relevance+Comprehensiveness)
LLM-as-Judge average score
Expert–LLM-as-Judge rank correlation
Pipeline update latency
Who Should Care
What To Try In 7 Days
Run a small prototype: Llama3-8B + LoRA + a 500-doc semantic cache (Faiss) and test on 50 bRAGq questions.
Implement a perplexity gate (τp=4.5) to trigger retrieval only when answers are uncertain.
Use DuckDuckGo + domain filters (.gov/.edu) and BM25 to fetch top authoritative documents for a few sample queries.
Agent Features
Memory
- Semantic cache (query-document pairs)
- Experience buffer with Faiss-based KNN management
Planning
- Perplexity-based gating to trigger retrieval and adaptation
- MDP-guided search prioritizing authoritative domains
Tool Use
- DuckDuckGo API
- PubMed / PMC / NIH as primary sources
Frameworks
- LoRA
Is Agentic
true
Architectures
- LoRA
- Semantic cache (Faiss) + web retriever (MDP-guided)
Optimization Features
Token Efficiency
- Context selection via cache and BM25 to limit prompt length
Infra Optimization
- Faiss indexing for fast similarity search
Model Optimization
- LoRA
System Optimization
- Cache size cap (example 500 docs) to bound compute
Training Optimization
- Online learning with regularized cross-entropy and Frobenius-norm regularizer
- Experience buffer updates via nearest-neighbor diversity
Inference Optimization
- Semantic cache lookup to reduce retrieval latency
- Constrained decoding to avoid unsafe words
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single expert reviewer for human evaluation (105 examples) limits clinical generality.
- Scalability: many cumulative edits may cause interference or capacity saturation.
- Generalization beyond locally edited facts is unproven.
- Reliance on web retrieval exposes system to source selection bias and possible conflicting evidence.
- Regulatory and deployment risks for direct patient use without clinician oversight.
When Not To Use
- For autonomous high-stakes clinical decisions without clinician supervision.
- In environments with no internet or restricted web access (system relies on web retrieval).
- When certified medical device-level validation and regulatory clearance are required.
Failure Modes
- Retrieval returns conflicting or low-quality documents that increase hallucinations.
- Cache eviction removes still-relevant documents, causing knowledge gaps.
- Perplexity gate misfires (false negatives or positives) leading to missed updates or unnecessary edits.
- Parametric updates overfit to recent documents and degrade earlier correct knowledge.
Core Entities
Models
- Llama3-8B
- Phi-3
- Mistral Instruct
- ChatGPT-4o
- RAG 2
- MedGraphRAG
Metrics
- Factuality
- Clinical Relevance
- Comprehensiveness
- Average score
- Spearman correlation
Datasets
- bRAGq
- PubMedQA
Benchmarks
- bRAGq (this work)
Context Entities
Models
- ASMBS guidelines (cited source)
Datasets
- PubMed / PMC / NIH content (retrieval sources)

