Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
5
Why It Matters For Business
Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.
Summary TLDR
Bailicai is a practical retrieval-augmented generation (RAG) framework built for medical question answering. It adds three specialized modules—Self-Knowledge Boundary Identification (decides if retrieval is needed), Directed Acyclic Graph (DAG) task decomposition (splits complex queries), and Medical Knowledge Injection (fine-tunes with curated medical data and hard negatives)—on top of RAG. Trained with LoRA on Meta-Llama-3-8B and using MedCPT + Faiss retrieval, Bailicai (8B) scores 71.82% average on five medical benchmarks, outperforms ChatGPT-3.5 by ~6 points, and shows better robustness to distracting documents. Key practical wins: fewer unnecessary retrieval calls, structured retrievals
Problem Statement
Open-source LLMs underperform proprietary models in medicine and hallucinate. Standard RAG can help but suffers from noisy/irrelevant documents and always-on retrieval costs. The problem: how to combine domain fine-tuning and smarter, selective retrieval so open models get high accuracy and lower hallucination in medical QA.
Main Contribution
A multi-module RAG framework (Bailicai) combining Medical Knowledge Injection, Self-Knowledge Boundary Identification, DAG task decomposition, and RAG.
A curated Bailicai medical dataset (173k+ training entries) built from UltraMedical with model-oriented filtering and hard negatives.
A two-stage dense retrieval pipeline (MedCPT + Faiss/HNSW + reranker) with tuned selection to reduce noise and retrieval cost.
Key Findings
Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.
Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.
Ablation shows the full four-module stack improves MedQA by 8.88% and MMLU-Med by 5.41% over the Meta-Llama-3-8B baseline.
Compared to a specialized Self-BioRAG retrieval model, Bailicai improves average performance by ~20.72 points on the evaluated datasets.
PubMed corpus gave the best retrieval performance among corpora tested with average 71.58%.
Results
Accuracy
Accuracy
Meta-Llama-3-8B baseline average
Accuracy
PubMed-only retrieval average
Who Should Care
What To Try In 7 Days
Train a small pilot: fine-tune an 8B open model on 50–100k high-quality medical Q&A using MODS-like selection.
Add a lightweight retrieval gate: implement a classifier to skip retrieval for 'known' queries and measure retrieval call reduction.
Index PubMed with a dense encoder (MedCPT or similar) and a reranker; test top-1 vs top-5 retrieval accuracy trade-offs.
Agent Features
Planning
- Directed Acyclic Graph Task Decomposition (structured planning for sub-tasks)
Tool Use
- Selective retrieval gate (Self-Knowledge Boundary Identification)
Optimization Features
Token Efficiency
- Model context limits set to 2816 tokens for MMedical; retrieval may be trimmed to avoid overflow
Infra Optimization
- Faiss+HNSW index for scalable nearest-neighbor search
Model Optimization
- LoRA
System Optimization
- Two-stage retrieval (coarse HNSW + fine reranker) to reduce candidate set
Training Optimization
- Model-oriented data selection (MODS/MoDS) and k-center greedy to choose diverse high-quality instruc
Inference Optimization
- Self-Knowledge Boundary Identification to avoid unnecessary retrieval calls
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Token-length constraints (≈2812) can truncate retrieved context and hurt datasets that include golden documents (PubMedQA).
- Results are for QA benchmarks; not evaluated on clinical deployment metrics or safety-critical workflows.
- No public code or dataset release stated, which limits exact reproduction.
When Not To Use
- When you must include extensive golden context that exceeds model token limits.
- When you cannot index a high-quality biomedical retrieval corpus (e.g., PubMed).
- For non-medical tasks where domain-specific fine-tuning and corpora are not available.
Failure Modes
- Wrong 'know' classification: gating may skip needed retrieval and produce incomplete answers.
- Retrieval of pseudo-relevant documents can still introduce hallucinatory or misleading content.
- Token overflow when many retrieved docs are concatenated, leading to truncated evidence and lower accuracy.
Core Entities
Models
- Bailicai
- Meta-Llama-3-8B
- Meta-Llama-3-70B
- Med-PaLM2
- Flan-PaLM
- ChatGPT-3.5
- ChatGPT-4
- Self-BioRAG
- OpenBioLLM
- PMC-LLaMA
- BioMistral
- MedCPT
- bge-reranker-large
Metrics
- Accuracy
- Average score (across benchmarks)
Datasets
- Bailicai dataset
- UltraMedical
- PubMed
- Wikipedia
- StatPearls
- Medical Textbooks
- Merge corpus (54.2M chunks)
Benchmarks
- MedQA
- MedMCQA
- MMLU-Med
- PubMedQA
- BioASQ
Context Entities
Models
- Flan-PaLM
- MedPaLM2
- Mistral-7B-v0.3
- Meta-Llama-3-70B
Metrics
- Accuracy
Datasets
- UltraMedical (source)
- PubMed search logs (used by MedCPT)
Benchmarks
- USMLE-adjacent datasets referenced (context for MedQA style tasks)

