Overview
Production Readiness
1
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Eir-8B shows tangible gains on Thai medical QA, translation, and 18 clinical tasks, so hospitals and health-tech teams can build higher-quality Thai clinical assistants while keeping data on-premises.
Summary TLDR
Eir-8B is an 8B-parameter LLM adapted from LLaMA-3.1 and fine-tuned with Thai/English clinical content plus synthetic data. It targets Thai medical tasks (QA, translation, EHR processing) and outperforms similar-size open models on medical benchmarks and a custom 18-task clinical test. The team used LoRA fine-tuning, DeepSpeed Stage 2, SLERP model merging, data filtering, RAG for QA generation, and both automatic and human/GPT-4o scoring. The authors caution the model is not yet ready for clinical deployment without further trials.
Problem Statement
Thai medical NLP is under-resourced. Off-the-shelf LLMs miss Thai medical terms, and hospital data privacy prevents cloud API use. The paper aims to build a Thai medical LLM that understands transliterated medical terms, works on Thai clinical tasks, and can be deployed inside hospital networks.
Main Contribution
Built Eir-8B and Eir-8B-prob by adapting LLaMA 3.1 Instruct-8B with LoRA and model merging (SLERP).
Constructed a mixed Thai/English clinical pretraining corpus (~100k pages) and added 266k synthetic QA pairs for instruction tuning.
Created a Clinically Adapted Model Enhanced test of 18 Thai clinical tasks and evaluated models with human and GPT-4o scoring.
Showed stronger performance than comparable open Thai LLMs on medical benchmarks, translation (BLEU), and the 18-task clinical suite.
Key Findings
Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.
Eir-8B produced the best medical translation BLEU score among tested models.
On a 0–10 clinical 18-task suite scored with GPT-4o, Eir-8B averaged higher than GPT-4o and other 8B baselines.
Results
Avg medical benchmark (MMLU-style)
Clinical 18-task average (0–10)
Medical translation quality (BLEU)
Thai language general score (M3Exam)
Who Should Care
What To Try In 7 Days
Run the authors' public model/tools locally on a small held-out clinical set to reproduce BLEU and clinical-task scores.
Apply LoRA fine-tuning on your in-house clinical notes (after IRB/consent) and validate with clinician reviewers.
Use RAG+embeddings to assemble a small Thai clinical knowledge base and test retrieval quality on typical queries.
Agent Features
Architectures
- Transformer (LLaMA 3.1 Instruct-8B)
- LoRA
Optimization Features
Token Efficiency
- Vocabulary size reported as 2,048 tokens
Infra Optimization
- Training on 4 × NVIDIA A100 40GB (approx. 105 hours total)
Model Optimization
- LoRA
- Spherical Linear Interpolation merging (SLERP)
Training Optimization
- DeepSpeed Zero Stage 2
- Gradient checkpointing
- LoRA
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not ready for clinical use without randomized trials and safety validation.
- Heavy reliance on synthetic data and GPT-4 translation may introduce artifacts.
- Evaluation partly scored by GPT-4o, which can introduce judge bias.
- Internal/hospital deployment details are described but internal EHR data are not public.
When Not To Use
- Do not use for autonomous clinical decision-making or real-time diagnosis.
- Avoid deploying without clinician oversight and local safety audits.
Failure Modes
- Hallucinated or incorrect medical advice despite high benchmark scores.
- Translation mismatches for complex terminology or context.
- Bias from synthetic or translated training data.
- Privacy leaks if on-premises security is misconfigured.
Core Entities
Models
- Eir-8B
- Eir-8B-prob
- LLaMA 3.1 Instruct-8B
- Typhoon-v1.5x-8B-instruct
- OpenThaiGPT-beta-7B
- BioMistral-7B
- Mistral-7B
- Gemma-2-9B
- GPT-3.5 Turbo 1106
- GPT-4-0613
- GPT-4o
Metrics
- Accuracy
- BLEU (translation)
- Average clinical score (0–10)
- M3Exam score
Datasets
- MedQA
- MedMCQA
- PubMedQA
- MMLU (medical subset)
- ThaiExam
- M3Exam
- XNLI
- XCOPA
- Open PMC Patient
- OpenPatient
- Custom Thai medical QA (266,080 synthetic pairs)
Benchmarks
- MMLU (medical subsets)
- MedQA/MedMCQA/PubMedQA
- ThaiExam / M3Exam
- Clinically Adapted 18-task test

