Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

July 24, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

5

Authors

Cui Long, Yongbin Liu, Chunping Ouyang, Ying Yu

Links

Abstract / PDF

Why It Matters For Business

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Summary TLDR

Bailicai is a practical retrieval-augmented generation (RAG) framework built for medical question answering. It adds three specialized modules—Self-Knowledge Boundary Identification (decides if retrieval is needed), Directed Acyclic Graph (DAG) task decomposition (splits complex queries), and Medical Knowledge Injection (fine-tunes with curated medical data and hard negatives)—on top of RAG. Trained with LoRA on Meta-Llama-3-8B and using MedCPT + Faiss retrieval, Bailicai (8B) scores 71.82% average on five medical benchmarks, outperforms ChatGPT-3.5 by ~6 points, and shows better robustness to distracting documents. Key practical wins: fewer unnecessary retrieval calls, structured retrievals

Problem Statement

Open-source LLMs underperform proprietary models in medicine and hallucinate. Standard RAG can help but suffers from noisy/irrelevant documents and always-on retrieval costs. The problem: how to combine domain fine-tuning and smarter, selective retrieval so open models get high accuracy and lower hallucination in medical QA.

Main Contribution

A multi-module RAG framework (Bailicai) combining Medical Knowledge Injection, Self-Knowledge Boundary Identification, DAG task decomposition, and RAG.

A curated Bailicai medical dataset (173k+ training entries) built from UltraMedical with model-oriented filtering and hard negatives.

A two-stage dense retrieval pipeline (MedCPT + Faiss/HNSW + reranker) with tuned selection to reduce noise and retrieval cost.

Key Findings

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

NumbersAverage = 71.82% (Table V)

Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.

Numbers71.82% vs 65.85% => +5.97pts (Table V)

Ablation shows the full four-module stack improves MedQA by 8.88% and MMLU-Med by 5.41% over the Meta-Llama-3-8B baseline.

Numbers+8.88pts (MedQA), +5.41pts (MMLU-Med) (Table VI)

Compared to a specialized Self-BioRAG retrieval model, Bailicai improves average performance by ~20.72 points on the evaluated datasets.

NumbersBailicai avg 71.82% vs Self-BioRAG 51.10% => +20.72pts (Results)

PubMed corpus gave the best retrieval performance among corpora tested with average 71.58%.

NumbersPubMed average = 71.58% (Table VIII)

Results

Accuracy

Value71.82%

Accuracy

Value65.85%

Meta-Llama-3-8B baseline average

Value67.07%

Accuracy

Value51.10%

PubMed-only retrieval average

Value71.58%

Who Should Care

What To Try In 7 Days

Train a small pilot: fine-tune an 8B open model on 50–100k high-quality medical Q&A using MODS-like selection.

Add a lightweight retrieval gate: implement a classifier to skip retrieval for 'known' queries and measure retrieval call reduction.

Index PubMed with a dense encoder (MedCPT or similar) and a reranker; test top-1 vs top-5 retrieval accuracy trade-offs.

Agent Features

Planning

  • Directed Acyclic Graph Task Decomposition (structured planning for sub-tasks)

Tool Use

  • Selective retrieval gate (Self-Knowledge Boundary Identification)

Optimization Features

Token Efficiency

  • Model context limits set to 2816 tokens for MMedical; retrieval may be trimmed to avoid overflow

Infra Optimization

  • Faiss+HNSW index for scalable nearest-neighbor search

Model Optimization

  • LoRA

System Optimization

  • Two-stage retrieval (coarse HNSW + fine reranker) to reduce candidate set

Training Optimization

  • Model-oriented data selection (MODS/MoDS) and k-center greedy to choose diverse high-quality instruc

Inference Optimization

  • Self-Knowledge Boundary Identification to avoid unnecessary retrieval calls

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Token-length constraints (≈2812) can truncate retrieved context and hurt datasets that include golden documents (PubMedQA).
  • Results are for QA benchmarks; not evaluated on clinical deployment metrics or safety-critical workflows.
  • No public code or dataset release stated, which limits exact reproduction.

When Not To Use

  • When you must include extensive golden context that exceeds model token limits.
  • When you cannot index a high-quality biomedical retrieval corpus (e.g., PubMed).
  • For non-medical tasks where domain-specific fine-tuning and corpora are not available.

Failure Modes

  • Wrong 'know' classification: gating may skip needed retrieval and produce incomplete answers.
  • Retrieval of pseudo-relevant documents can still introduce hallucinatory or misleading content.
  • Token overflow when many retrieved docs are concatenated, leading to truncated evidence and lower accuracy.

Core Entities

Models

  • Bailicai
  • Meta-Llama-3-8B
  • Meta-Llama-3-70B
  • Med-PaLM2
  • Flan-PaLM
  • ChatGPT-3.5
  • ChatGPT-4
  • Self-BioRAG
  • OpenBioLLM
  • PMC-LLaMA
  • BioMistral
  • MedCPT
  • bge-reranker-large

Metrics

  • Accuracy
  • Average score (across benchmarks)

Datasets

  • Bailicai dataset
  • UltraMedical
  • PubMed
  • Wikipedia
  • StatPearls
  • Medical Textbooks
  • Merge corpus (54.2M chunks)

Benchmarks

  • MedQA
  • MedMCQA
  • MMLU-Med
  • PubMedQA
  • BioASQ

Context Entities

Models

  • Flan-PaLM
  • MedPaLM2
  • Mistral-7B-v0.3
  • Meta-Llama-3-70B

Metrics

  • Accuracy

Datasets

  • UltraMedical (source)
  • PubMed search logs (used by MedCPT)

Benchmarks

  • USMLE-adjacent datasets referenced (context for MedQA style tasks)