Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

Overview

Decision SnapshotNeeds Validation

The approach is practical: it combines curated fine-tuning, retrieval gating, and DAG decomposition; evidence comes from benchmark gains and ablations on standard datasets.

Citations5

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Cui Long, Yongbin Liu, Chunping Ouyang, Ying Yu

Links

Abstract / PDF

Why It Matters For Business

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Who Should Care

ML Engineer Data Scientist CTO

Summary TLDR

Bailicai is a practical retrieval-augmented generation (RAG) framework built for medical question answering. It adds three specialized modules—Self-Knowledge Boundary Identification (decides if retrieval is needed), Directed Acyclic Graph (DAG) task decomposition (splits complex queries), and Medical Knowledge Injection (fine-tunes with curated medical data and hard negatives)—on top of RAG. Trained with LoRA on Meta-Llama-3-8B and using MedCPT + Faiss retrieval, Bailicai (8B) scores 71.82% average on five medical benchmarks, outperforms ChatGPT-3.5 by ~6 points, and shows better robustness to distracting documents. Key practical wins: fewer unnecessary retrieval calls, structured retrievals

Problem Statement

Open-source LLMs underperform proprietary models in medicine and hallucinate. Standard RAG can help but suffers from noisy/irrelevant documents and always-on retrieval costs. The problem: how to combine domain fine-tuning and smarter, selective retrieval so open models get high accuracy and lower hallucination in medical QA.

Main Contribution

A multi-module RAG framework (Bailicai) combining Medical Knowledge Injection, Self-Knowledge Boundary Identification, DAG task decomposition, and RAG.

A curated Bailicai medical dataset (173k+ training entries) built from UltraMedical with model-oriented filtering and hard negatives.

Key Findings

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

NumbersAverage = 71.82% (Table V)

Practical UseYou can run an 8B local model with RAG and curated fine-tuning to reach near state-of-the-art medical QA without calling large closed APIs.

Evidence RefTable V

Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.

Numbers71.82% vs 65.85% => +5.97pts (Table V)

Practical UseDeploying Bailicai locally can match or exceed consumer-grade API models for medical QA, reducing privacy exposure and API costs.

Evidence RefTable V

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	71.82%	—	—	MedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQ	Reported Bailicai (8B) average	Table V
Accuracy	65.85%	—	-5.97 pts vs Bailicai	MedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQ	Reported ChatGPT-3.5 average	Table V

What To Try In 7 Days

Train a small pilot: fine-tune an 8B open model on 50–100k high-quality medical Q&A using MODS-like selection.

Add a lightweight retrieval gate: implement a classifier to skip retrieval for 'known' queries and measure retrieval call reduction.

Index PubMed with a dense encoder (MedCPT or similar) and a reranker; test top-1 vs top-5 retrieval accuracy trade-offs.

Agent Features

Planning

Directed Acyclic Graph Task Decomposition (structured planning for sub-tasks)

Tool Use

Selective retrieval gate (Self-Knowledge Boundary Identification)

Optimization Features

Token Efficiency

Model context limits set to 2816 tokens for MMedical; retrieval may be trimmed to avoid overflow

Infra Optimization

Faiss+HNSW index for scalable nearest-neighbor search

Model Optimization

LoRA

System Optimization

Two-stage retrieval (coarse HNSW + fine reranker) to reduce candidate set

Training Optimization

Model-oriented data selection (MODS/MoDS) and k-center greedy to choose diverse high-quality instruc

Inference Optimization

Self-Knowledge Boundary Identification to avoid unnecessary retrieval calls

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Token-length constraints (≈2812) can truncate retrieved context and hurt datasets that include golden documents (PubMedQA).

Results are for QA benchmarks; not evaluated on clinical deployment metrics or safety-critical workflows.

When Not To Use

When you must include extensive golden context that exceeds model token limits.

When you cannot index a high-quality biomedical retrieval corpus (e.g., PubMed).

Failure Modes

Wrong 'know' classification: gating may skip needed retrieval and produce incomplete answers.

Retrieval of pseudo-relevant documents can still introduce hallucinatory or misleading content.

Core Entities

Models

BailicaiMeta-Llama-3-8BMeta-Llama-3-70BMed-PaLM2Flan-PaLMChatGPT-3.5ChatGPT-4Self-BioRAGOpenBioLLMPMC-LLaMABioMistralMedCPTbge-reranker-large

Metrics

AccuracyAverage score (across benchmarks)

Datasets

Bailicai datasetUltraMedicalPubMedWikipediaStatPearlsMedical TextbooksMerge corpus (54.2M chunks)

Benchmarks

MedQAMedMCQAMMLU-MedPubMedQABioASQ

Context Entities

Models

Flan-PaLMMedPaLM2Mistral-7B-v0.3Meta-Llama-3-70B

Metrics

Accuracy

Datasets

UltraMedical (source)PubMed search logs (used by MedCPT)

Benchmarks

USMLE-adjacent datasets referenced (context for MedQA style tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding