Overview
The approach is practical: it combines curated fine-tuning, retrieval gating, and DAG decomposition; evidence comes from benchmark gains and ablations on standard datasets.
Citations5
Evidence Strength0.70
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.
Who Should Care
Summary TLDR
Bailicai is a practical retrieval-augmented generation (RAG) framework built for medical question answering. It adds three specialized modules—Self-Knowledge Boundary Identification (decides if retrieval is needed), Directed Acyclic Graph (DAG) task decomposition (splits complex queries), and Medical Knowledge Injection (fine-tunes with curated medical data and hard negatives)—on top of RAG. Trained with LoRA on Meta-Llama-3-8B and using MedCPT + Faiss retrieval, Bailicai (8B) scores 71.82% average on five medical benchmarks, outperforms ChatGPT-3.5 by ~6 points, and shows better robustness to distracting documents. Key practical wins: fewer unnecessary retrieval calls, structured retrievals
Problem Statement
Open-source LLMs underperform proprietary models in medicine and hallucinate. Standard RAG can help but suffers from noisy/irrelevant documents and always-on retrieval costs. The problem: how to combine domain fine-tuning and smarter, selective retrieval so open models get high accuracy and lower hallucination in medical QA.
Main Contribution
A multi-module RAG framework (Bailicai) combining Medical Knowledge Injection, Self-Knowledge Boundary Identification, DAG task decomposition, and RAG.
A curated Bailicai medical dataset (173k+ training entries) built from UltraMedical with model-oriented filtering and hard negatives.
Key Findings
Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.
Bailicai beats ChatGPT-3.5 by 5.97 percentage points on the same benchmark suite.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 71.82% | — | — | MedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQ | Reported Bailicai (8B) average | Table V |
| Accuracy | 65.85% | — | -5.97 pts vs Bailicai | MedQA, MedMCQA, MMLU-Med, PubMedQA, BioASQ | Reported ChatGPT-3.5 average | Table V |
What To Try In 7 Days
Train a small pilot: fine-tune an 8B open model on 50–100k high-quality medical Q&A using MODS-like selection.
Add a lightweight retrieval gate: implement a classifier to skip retrieval for 'known' queries and measure retrieval call reduction.
Index PubMed with a dense encoder (MedCPT or similar) and a reranker; test top-1 vs top-5 retrieval accuracy trade-offs.
Agent Features
Planning
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Model-oriented data selection (MODS/MoDS) and k-center greedy to choose diverse high-quality instruc
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Token-length constraints (≈2812) can truncate retrieved context and hurt datasets that include golden documents (PubMedQA).
Results are for QA benchmarks; not evaluated on clinical deployment metrics or safety-critical workflows.
When Not To Use
When you must include extensive golden context that exceeds model token limits.
When you cannot index a high-quality biomedical retrieval corpus (e.g., PubMed).
Failure Modes
Wrong 'know' classification: gating may skip needed retrieval and produce incomplete answers.
Retrieval of pseudo-relevant documents can still introduce hallucinatory or misleading content.

