Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

July 24, 20248 min

Overview

Decision SnapshotNeeds Validation

The approach is practical: it combines curated fine-tuning, retrieval gating, and DAG decomposition; evidence comes from benchmark gains and ablations on standard datasets.

Citations5

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 70%

Authors

Cui Long, Yongbin Liu, Chunping Ouyang, Ying Yu

Links

Abstract / PDF

Why It Matters For Business

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Who Should Care

Summary TLDR

Bailicai is a practical retrieval-augmented generation (RAG) framework built for medical question answering. It adds three specialized modules—Self-Knowledge Boundary Identification (decides if retrieval is needed), Directed Acyclic Graph (DAG) task decomposition (splits complex queries), and Medical Knowledge Injection (fine-tunes with curated medical data and hard negatives)—on top of RAG. Trained with LoRA on Meta-Llama-3-8B and using MedCPT + Faiss retrieval, Bailicai (8B) scores 71.82% average on five medical benchmarks, outperforms ChatGPT-3.5 by ~6 points, and shows better robustness to distracting documents. Key practical wins: fewer unnecessary retrieval calls, structured retrievals

Problem Statement

Open-source LLMs underperform proprietary models in medicine and hallucinate. Standard RAG can help but suffers from noisy/irrelevant documents and always-on retrieval costs. The problem: how to combine domain fine-tuning and smarter, selective retrieval so open models get high accuracy and lower hallucination in medical QA.

Main Contribution

A multi-module RAG framework (Bailicai) combining Medical Knowledge Injection, Self-Knowledge Boundary Identification, DAG task decomposition, and RAG.

A curated Bailicai medical dataset (173k+ training entries) built from UltraMedical with model-oriented filtering and hard negatives.

Key Findings

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

NumbersAverage = 71.82% (Table V)

Practical UseYou can run an 8B local model with RAG and curated fine-tuning to reach near state-of-the-art medical QA without calling large closed APIs.

Evidence RefTable V

Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.

Numbers71.82% vs 65.85% => +5.97pts (Table V)

Practical UseDeploying Bailicai locally can match or exceed consumer-grade API models for medical QA, reducing privacy exposure and API costs.

Evidence RefTable V

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy71.82%MedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQReported Bailicai (8B) averageTable V
Accuracy65.85%-5.97 pts vs BailicaiMedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQReported ChatGPT-3.5 averageTable V

What To Try In 7 Days

Train a small pilot: fine-tune an 8B open model on 50–100k high-quality medical Q&A using MODS-like selection.

Add a lightweight retrieval gate: implement a classifier to skip retrieval for 'known' queries and measure retrieval call reduction.

Index PubMed with a dense encoder (MedCPT or similar) and a reranker; test top-1 vs top-5 retrieval accuracy trade-offs.

Agent Features

Planning
Directed Acyclic Graph Task Decomposition (structured planning for sub-tasks)
Tool Use
Selective retrieval gate (Self-Knowledge Boundary Identification)

Optimization Features

Token Efficiency
Model context limits set to 2816 tokens for MMedical; retrieval may be trimmed to avoid overflow
Infra Optimization
Faiss+HNSW index for scalable nearest-neighbor search
Model Optimization
LoRA
System Optimization
Two-stage retrieval (coarse HNSW + fine reranker) to reduce candidate set
Training Optimization

Model-oriented data selection (MODS/MoDS) and k-center greedy to choose diverse high-quality instruc

Inference Optimization
Self-Knowledge Boundary Identification to avoid unnecessary retrieval calls

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Token-length constraints (≈2812) can truncate retrieved context and hurt datasets that include golden documents (PubMedQA).

Results are for QA benchmarks; not evaluated on clinical deployment metrics or safety-critical workflows.

When Not To Use

When you must include extensive golden context that exceeds model token limits.

When you cannot index a high-quality biomedical retrieval corpus (e.g., PubMed).

Failure Modes

Wrong 'know' classification: gating may skip needed retrieval and produce incomplete answers.

Retrieval of pseudo-relevant documents can still introduce hallucinatory or misleading content.

Core Entities

Models

BailicaiMeta-Llama-3-8BMeta-Llama-3-70BMed-PaLM2Flan-PaLMChatGPT-3.5ChatGPT-4Self-BioRAGOpenBioLLMPMC-LLaMABioMistralMedCPTbge-reranker-large

Metrics

AccuracyAverage score (across benchmarks)

Datasets

Bailicai datasetUltraMedicalPubMedWikipediaStatPearlsMedical TextbooksMerge corpus (54.2M chunks)

Benchmarks

MedQAMedMCQAMMLU-MedPubMedQABioASQ

Context Entities

Models

Flan-PaLMMedPaLM2Mistral-7B-v0.3Meta-Llama-3-70B

Metrics

Accuracy

Datasets

UltraMedical (source)PubMed search logs (used by MedCPT)

Benchmarks

USMLE-adjacent datasets referenced (context for MedQA style tasks)