Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
4
Why It Matters For Business
BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.
Summary TLDR
BiMediX is a bilingual (English/Arabic) medical chatbot built by instruction-tuning a Mixtral MoE model on BiMed1.3M — a 1.3M-sample, 632M-token bilingual medical instruction set. The team used a semi-automated English→Arabic translation pipeline with human checks to create an Arabic medical benchmark. BiMediX improves diagnostic-style multiple-choice accuracy modestly over strong English medical baselines (≈+2.5% vs Med42) while delivering much faster inference (≈180.6 tps vs ~20.9 tps for 70B models). The release includes code, model, and evaluation assets for research use only; it is not ready for clinical deployment.
Problem Statement
Medical LLM work is dominated by English models and large dense models. Arabic medical data and evaluation tools are scarce. There is a need for a bilingual medical LLM that can (1) handle multi-turn clinical-style chats, (2) work in Arabic and English, and (3) be efficient to run.
Main Contribution
BiMediX: the first bilingual (English/Arabic) medical Mixture-of-Experts (MoE) LLM designed for multi-turn chats, MCQA, and open QA.
BiMed1.3M: an instruction-tuning dataset of ~1.31M bilingual medical interactions (632.3M tokens), including 249.7k synthesized multi-turn chats.
A semi-automated English→Arabic iterative translation pipeline with human refinement to create a high-quality Arabic medical benchmark and bilingual instructions.
Parameter-efficient bilingual instruction tuning of Mixtral-8x7B using QLoRA adapters on experts and the router (train ~4% of params) to get strong performance with limited compute.
A public bilingual evaluation benchmark (translated from existing English medical datasets) and release of code/models for research.
Key Findings
BiMediX beats Med42 and Meditron on English medical benchmarks.
BiMediX substantially improves Arabic medical accuracy over a large Arabic-centric model.
BiMediX is much faster at inference than large 70B medical models.
BiMed1.3M is large and bilingual with multi-turn dialogues.
Results
Accuracy
Accuracy
Accuracy
Inference throughput
Inference latency
Training compute and time
Who Should Care
What To Try In 7 Days
Download BiMediX code/model and run the bilingual benchmark locally to profile latency and accuracy on your hardware.
Use the BiMed1.3M subset and QLoRA adapters to finetune a smaller Mixtral or Llama family model for domain-specific workflows.
Translate 5–10k domain examples into your target language using the paper's semi-automated pipeline and validate translations with native speakers.
Agent Features
Memory
- 32,768-token context window
Frameworks
- LoRA
- PEFT
- PyTorch
- Deepspeed
- ZeRO
Architectures
- MoE
- Mixtral-8x7B (sparse experts, router)
Optimization Features
Token Efficiency
- Instruction tuning on 632M medical tokens focused on domain interactions
Infra Optimization
- 8× A100-80GB GPUs; training completed in ~35 hours
Model Optimization
- Sparse MoE active params reduced (~13B active of 47B total)
- Router directs tokens to two experts for sparse compute
System Optimization
- Adapter low-rank setup (rank 128, α=64) to avoid full fine-tuning
Training Optimization
- LoRA
- Gradient checkpointing and Deepspeed ZeRO for memory efficiency
Inference Optimization
- Sparse activation yields large throughput gains (180.6 tps)
- Lower latency makes real-time chat feasible
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model can hallucinate, produce toxic or biased outputs
- Limited human evaluation; medical accuracy not guaranteed
- No explicit safety or alignment mechanisms integrated
- Arabic translations depend on LLM plus human checks and may still contain errors
When Not To Use
- Do not use for clinical decision-making or unattended diagnoses
- Avoid deployment in high-stakes patient care without expert oversight
- Do not replace certified medical professionals or regulatory processes
Failure Modes
- Hallucinated facts or incorrect medical recommendations
- Overconfident but wrong answers on MCQA or open questions
- Translation errors that change clinical meaning
- Biases inherited from pretraining data affecting minority groups
Core Entities
Models
- BiMediX
- Mixtral-8x7B
- Mixtral
- Jais-30B
- Med42-70B
- Meditron-70B
- PMC-LLaMA-13B
- Clinical Camel-70B
Metrics
- Accuracy
- latency (s)
- tokens/sec
Datasets
- BiMed1.3M
- PubMedQA
- MedMCQA
- MedQA
- Medical MMLU
- HealthCareMagic
- iCliniq
- Medical Meadow
- UMLS
- LiveQA
- MedicationQA
Benchmarks
- BiMediX Arabic benchmark (this paper)
- Bilingual benchmark (this paper)
- MedMCQA
- MedQA
- PubMedQA
- Medical MMLU

