Eir-8B: an 8B-parameter Thai medical LLM that improves medical QA, translation, and 18 clinical tasks

September 13, 20247 min

Overview

Production Readiness

1

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Yutthakorn Thiprak, Rungtam Ngodngamthaweesuk, Songtam Ngodngamtaweesuk

Links

Abstract / PDF

Why It Matters For Business

Eir-8B shows tangible gains on Thai medical QA, translation, and 18 clinical tasks, so hospitals and health-tech teams can build higher-quality Thai clinical assistants while keeping data on-premises.

Summary TLDR

Eir-8B is an 8B-parameter LLM adapted from LLaMA-3.1 and fine-tuned with Thai/English clinical content plus synthetic data. It targets Thai medical tasks (QA, translation, EHR processing) and outperforms similar-size open models on medical benchmarks and a custom 18-task clinical test. The team used LoRA fine-tuning, DeepSpeed Stage 2, SLERP model merging, data filtering, RAG for QA generation, and both automatic and human/GPT-4o scoring. The authors caution the model is not yet ready for clinical deployment without further trials.

Problem Statement

Thai medical NLP is under-resourced. Off-the-shelf LLMs miss Thai medical terms, and hospital data privacy prevents cloud API use. The paper aims to build a Thai medical LLM that understands transliterated medical terms, works on Thai clinical tasks, and can be deployed inside hospital networks.

Main Contribution

Built Eir-8B and Eir-8B-prob by adapting LLaMA 3.1 Instruct-8B with LoRA and model merging (SLERP).

Constructed a mixed Thai/English clinical pretraining corpus (~100k pages) and added 266k synthetic QA pairs for instruction tuning.

Created a Clinically Adapted Model Enhanced test of 18 Thai clinical tasks and evaluated models with human and GPT-4o scoring.

Showed stronger performance than comparable open Thai LLMs on medical benchmarks, translation (BLEU), and the 18-task clinical suite.

Key Findings

Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.

NumbersAvg MMLU: Eir-8B+Prob 80.2 vs Typhoon 69.1 (Δ ≈ +11.1)

Eir-8B produced the best medical translation BLEU score among tested models.

NumbersBLEU = 61.10 (Eir-8B) vs 35.74 (LLaMA 3.1 base)

On a 0–10 clinical 18-task suite scored with GPT-4o, Eir-8B averaged higher than GPT-4o and other 8B baselines.

NumbersAverage = 7.11 (Eir-8B) vs 6.38 (GPT-4o), ~+11% relative

Results

Avg medical benchmark (MMLU-style)

ValueEir-8B: 71.9; Eir-8B+Prob: 80.2

BaselineTyphoon-v1.5x-8B-instruct: 69.1

Clinical 18-task average (0–10)

ValueEir-8B: 7.11

BaselineGPT-4o: 6.38

Medical translation quality (BLEU)

ValueEir-8B BLEU = 61.10

BaselineMeta LLaMA 3.1-8B: 35.74

Thai language general score (M3Exam)

ValueEir-8B M3Exam = 0.458

BaselineMeta Llama 3.1-8B Instruct = 0.446

Who Should Care

What To Try In 7 Days

Run the authors' public model/tools locally on a small held-out clinical set to reproduce BLEU and clinical-task scores.

Apply LoRA fine-tuning on your in-house clinical notes (after IRB/consent) and validate with clinician reviewers.

Use RAG+embeddings to assemble a small Thai clinical knowledge base and test retrieval quality on typical queries.

Agent Features

Architectures

  • Transformer (LLaMA 3.1 Instruct-8B)
  • LoRA

Optimization Features

Token Efficiency

  • Vocabulary size reported as 2,048 tokens

Infra Optimization

  • Training on 4 × NVIDIA A100 40GB (approx. 105 hours total)

Model Optimization

  • LoRA
  • Spherical Linear Interpolation merging (SLERP)

Training Optimization

  • DeepSpeed Zero Stage 2
  • Gradient checkpointing
  • LoRA

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not ready for clinical use without randomized trials and safety validation.
  • Heavy reliance on synthetic data and GPT-4 translation may introduce artifacts.
  • Evaluation partly scored by GPT-4o, which can introduce judge bias.
  • Internal/hospital deployment details are described but internal EHR data are not public.

When Not To Use

  • Do not use for autonomous clinical decision-making or real-time diagnosis.
  • Avoid deploying without clinician oversight and local safety audits.

Failure Modes

  • Hallucinated or incorrect medical advice despite high benchmark scores.
  • Translation mismatches for complex terminology or context.
  • Bias from synthetic or translated training data.
  • Privacy leaks if on-premises security is misconfigured.

Core Entities

Models

  • Eir-8B
  • Eir-8B-prob
  • LLaMA 3.1 Instruct-8B
  • Typhoon-v1.5x-8B-instruct
  • OpenThaiGPT-beta-7B
  • BioMistral-7B
  • Mistral-7B
  • Gemma-2-9B
  • GPT-3.5 Turbo 1106
  • GPT-4-0613
  • GPT-4o

Metrics

  • Accuracy
  • BLEU (translation)
  • Average clinical score (0–10)
  • M3Exam score

Datasets

  • MedQA
  • MedMCQA
  • PubMedQA
  • MMLU (medical subset)
  • ThaiExam
  • M3Exam
  • XNLI
  • XCOPA
  • Open PMC Patient
  • OpenPatient
  • Custom Thai medical QA (266,080 synthetic pairs)

Benchmarks

  • MMLU (medical subsets)
  • MedQA/MedMCQA/PubMedQA
  • ThaiExam / M3Exam
  • Clinically Adapted 18-task test