Eir-8B: an 8B-parameter Thai medical LLM that improves medical QA, translation, and 18 clinical tasks

Overview

Decision SnapshotReady For Pilot

Model-level gains are supported by benchmark tables and translation scores; however, evaluation uses GPT-4o and synthetic data and authors explicitly warn against immediate clinical deployment.

Citations1

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 60%

Authors

Yutthakorn Thiprak, Rungtam Ngodngamthaweesuk, Songtam Ngodngamtaweesuk

Links

Abstract / PDF

Why It Matters For Business

Eir-8B shows tangible gains on Thai medical QA, translation, and 18 clinical tasks, so hospitals and health-tech teams can build higher-quality Thai clinical assistants while keeping data on-premises.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

Eir-8B is an 8B-parameter LLM adapted from LLaMA-3.1 and fine-tuned with Thai/English clinical content plus synthetic data. It targets Thai medical tasks (QA, translation, EHR processing) and outperforms similar-size open models on medical benchmarks and a custom 18-task clinical test. The team used LoRA fine-tuning, DeepSpeed Stage 2, SLERP model merging, data filtering, RAG for QA generation, and both automatic and human/GPT-4o scoring. The authors caution the model is not yet ready for clinical deployment without further trials.

Problem Statement

Thai medical NLP is under-resourced. Off-the-shelf LLMs miss Thai medical terms, and hospital data privacy prevents cloud API use. The paper aims to build a Thai medical LLM that understands transliterated medical terms, works on Thai clinical tasks, and can be deployed inside hospital networks.

Main Contribution

Built Eir-8B and Eir-8B-prob by adapting LLaMA 3.1 Instruct-8B with LoRA and model merging (SLERP).

Constructed a mixed Thai/English clinical pretraining corpus (~100k pages) and added 266k synthetic QA pairs for instruction tuning.

Key Findings

Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.

NumbersAvg MMLU: Eir-8B+Prob 80.2 vs Typhoon 69.1 (Δ ≈ +11.1)

Practical UseIf you need top open-source performance on multiple medical-choice benchmarks, use Eir-8B+prob style ensembling/prompting rather than Typhoon 8B.

Evidence RefTable 4

Eir-8B produced the best medical translation BLEU score among tested models.

NumbersBLEU = 61.10 (Eir-8B) vs 35.74 (LLaMA 3.1 base)

Practical UseFor medical English–Thai translation in Thailand, Eir-8B gives materially better literal and term-preserving translations; validate outputs with clinicians before deployment.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Avg medical benchmark (MMLU-style)	Eir-8B: 71.9; Eir-8B+Prob: 80.2	Typhoon-v1.5x-8B-instruct: 69.1	Eir-8B+Prob +11.1 vs Typhoon; Eir-8B +2.8 vs Typhoon	Table 4 (aggregated columns)	Aggregated MMLU and medical QA scores across multiple domains	Table 4
Clinical 18-task average (0–10)	Eir-8B: 7.11	GPT-4o: 6.38	+0.73 (≈11% relative)	Clinically Adapted Model Enhanced test	Table 7 (scores per task)	Table 7

What To Try In 7 Days

Run the authors' public model/tools locally on a small held-out clinical set to reproduce BLEU and clinical-task scores.

Apply LoRA fine-tuning on your in-house clinical notes (after IRB/consent) and validate with clinician reviewers.

Use RAG+embeddings to assemble a small Thai clinical knowledge base and test retrieval quality on typical queries.

Agent Features

Architectures

Transformer (LLaMA 3.1 Instruct-8B)LoRA

Optimization Features

Token Efficiency

Vocabulary size reported as 2,048 tokens

Infra Optimization

Training on 4 × NVIDIA A100 40GB (approx. 105 hours total)

Model Optimization

LoRASpherical Linear Interpolation merging (SLERP)

Training Optimization

DeepSpeed Zero Stage 2Gradient checkpointingLoRA

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Not ready for clinical use without randomized trials and safety validation.

Heavy reliance on synthetic data and GPT-4 translation may introduce artifacts.

When Not To Use

Do not use for autonomous clinical decision-making or real-time diagnosis.

Avoid deploying without clinician oversight and local safety audits.

Failure Modes

Hallucinated or incorrect medical advice despite high benchmark scores.

Translation mismatches for complex terminology or context.

Core Entities

Models

Eir-8BEir-8B-probLLaMA 3.1 Instruct-8BTyphoon-v1.5x-8B-instructOpenThaiGPT-beta-7BBioMistral-7BMistral-7BGemma-2-9BGPT-3.5 Turbo 1106GPT-4-0613GPT-4o

Metrics

AccuracyBLEU (translation)Average clinical score (0–10)M3Exam score

Datasets

MedQAMedMCQAPubMedQAMMLU (medical subset)ThaiExamM3ExamXNLIXCOPAOpen PMC PatientOpenPatientCustom Thai medical QA (266,080 synthetic pairs)

Benchmarks

MMLU (medical subsets)MedQA/MedMCQA/PubMedQAThaiExam / M3ExamClinically Adapted 18-task test

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.

Eir-8B produced the best medical translation BLEU score among tested models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding