Eir-8B: an 8B-parameter Thai medical LLM that improves medical QA, translation, and 18 clinical tasks

September 13, 20247 min

Overview

Decision SnapshotReady For Pilot

Model-level gains are supported by benchmark tables and translation scores; however, evaluation uses GPT-4o and synthetic data and authors explicitly warn against immediate clinical deployment.

Citations1

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 60%

Authors

Yutthakorn Thiprak, Rungtam Ngodngamthaweesuk, Songtam Ngodngamtaweesuk

Links

Abstract / PDF

Why It Matters For Business

Eir-8B shows tangible gains on Thai medical QA, translation, and 18 clinical tasks, so hospitals and health-tech teams can build higher-quality Thai clinical assistants while keeping data on-premises.

Who Should Care

Summary TLDR

Eir-8B is an 8B-parameter LLM adapted from LLaMA-3.1 and fine-tuned with Thai/English clinical content plus synthetic data. It targets Thai medical tasks (QA, translation, EHR processing) and outperforms similar-size open models on medical benchmarks and a custom 18-task clinical test. The team used LoRA fine-tuning, DeepSpeed Stage 2, SLERP model merging, data filtering, RAG for QA generation, and both automatic and human/GPT-4o scoring. The authors caution the model is not yet ready for clinical deployment without further trials.

Problem Statement

Thai medical NLP is under-resourced. Off-the-shelf LLMs miss Thai medical terms, and hospital data privacy prevents cloud API use. The paper aims to build a Thai medical LLM that understands transliterated medical terms, works on Thai clinical tasks, and can be deployed inside hospital networks.

Main Contribution

Built Eir-8B and Eir-8B-prob by adapting LLaMA 3.1 Instruct-8B with LoRA and model merging (SLERP).

Constructed a mixed Thai/English clinical pretraining corpus (~100k pages) and added 266k synthetic QA pairs for instruction tuning.

Key Findings

Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.

NumbersAvg MMLU: Eir-8B+Prob 80.2 vs Typhoon 69.1≈ +11.1)

Practical UseIf you need top open-source performance on multiple medical-choice benchmarks, use Eir-8B+prob style ensembling/prompting rather than Typhoon 8B.

Evidence RefTable 4

Eir-8B produced the best medical translation BLEU score among tested models.

NumbersBLEU = 61.10 (Eir-8B) vs 35.74 (LLaMA 3.1 base)

Practical UseFor medical English–Thai translation in Thailand, Eir-8B gives materially better literal and term-preserving translations; validate outputs with clinicians before deployment.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Avg medical benchmark (MMLU-style)Eir-8B: 71.9; Eir-8B+Prob: 80.2Typhoon-v1.5x-8B-instruct: 69.1Eir-8B+Prob +11.1 vs Typhoon; Eir-8B +2.8 vs TyphoonTable 4 (aggregated columns)Aggregated MMLU and medical QA scores across multiple domainsTable 4
Clinical 18-task average (0–10)Eir-8B: 7.11GPT-4o: 6.38+0.73 (≈11% relative)Clinically Adapted Model Enhanced testTable 7 (scores per task)Table 7

What To Try In 7 Days

Run the authors' public model/tools locally on a small held-out clinical set to reproduce BLEU and clinical-task scores.

Apply LoRA fine-tuning on your in-house clinical notes (after IRB/consent) and validate with clinician reviewers.

Use RAG+embeddings to assemble a small Thai clinical knowledge base and test retrieval quality on typical queries.

Agent Features

Architectures
Transformer (LLaMA 3.1 Instruct-8B)LoRA

Optimization Features

Token Efficiency
Vocabulary size reported as 2,048 tokens
Infra Optimization
Training on 4 × NVIDIA A100 40GB (approx. 105 hours total)
Model Optimization
LoRASpherical Linear Interpolation merging (SLERP)
Training Optimization
DeepSpeed Zero Stage 2Gradient checkpointingLoRA

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Not ready for clinical use without randomized trials and safety validation.

Heavy reliance on synthetic data and GPT-4 translation may introduce artifacts.

When Not To Use

Do not use for autonomous clinical decision-making or real-time diagnosis.

Avoid deploying without clinician oversight and local safety audits.

Failure Modes

Hallucinated or incorrect medical advice despite high benchmark scores.

Translation mismatches for complex terminology or context.

Core Entities

Models

Eir-8BEir-8B-probLLaMA 3.1 Instruct-8BTyphoon-v1.5x-8B-instructOpenThaiGPT-beta-7BBioMistral-7BMistral-7BGemma-2-9BGPT-3.5 Turbo 1106GPT-4-0613GPT-4o

Metrics

AccuracyBLEU (translation)Average clinical score (0–10)M3Exam score

Datasets

MedQAMedMCQAPubMedQAMMLU (medical subset)ThaiExamM3ExamXNLIXCOPAOpen PMC PatientOpenPatientCustom Thai medical QA (266,080 synthetic pairs)

Benchmarks

MMLU (medical subsets)MedQA/MedMCQA/PubMedQAThaiExam / M3ExamClinically Adapted 18-task test