RedWhale: adapt an English LLM to Korean with small-data continual pretraining and tokenizer tweaks

Overview

Decision SnapshotNeeds Validation

The paper shows practical wins (token reduction, improved KoBEST) and a clear recipe, but experiments are limited to specific corpora, two proprietary QA sets, and single-GPU runs; broader validation is needed before widespread production use.

Citations0

Evidence Strength0.60

Confidence0.74

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi

Links

Abstract / PDF / Data

Why It Matters For Business

You can adapt an English LLM to Korean with far less compute by filtering data, using a Korean-aware tokenizer, initializing new tokens smartly, and staged training—this lowers cost and enables deployments for teams without massive GPU clusters.

Who Should Care

ML Engineer Product Manager Founder

Summary TLDR

RedWhale adapts an English LLM to Korean by (1) filtering a Korean corpus, (2) building a hybrid Korean-aware tokenizer, (3) initializing new token embeddings using subtoken + head decomposition, and (4) a staged continual-pretraining schedule with LoRA. On KoBEST and internal Financial QA, RedWhale matches or slightly beats top Korean-adapted models (EEVE) after ~9.7B tokens and ~498 H100 GPU hours. Key wins: ~0.57 token ratio (fewer tokens per input), a practical initialization that cuts pretraining loss, and a low-memory staged workflow for limited hardware.

Problem Statement

English-centric LLMs struggle on Korean due to different token structure and limited Korean corpora. Training full models from scratch is costly and often impossible on limited hardware. The paper asks: can we efficiently adapt an English LLM to Korean with limited compute, a tailored tokenizer, careful data filtering, and staged continual pretraining?

Main Contribution

A four-step pipeline for language adaptation: corpus filtering, Korean tokenizer adaptation, five initialization methods for new tokens, and a multistage training schedule with LoRA.

A hybrid SentencePiece tokenizer tuned for Korean (vocab ≈20k) that reduced token counts vs. base model (Token Ratio ≈0.57).

Key Findings

RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.

NumbersKoBEST AVG: RedWhale-SFT 0.8083 vs EEVE-SFT 0.7942

Practical UseIf you need a Korean-adapted model for downstream tasks, RedWhale gives marginally better KoBEST task accuracy than leading public adaptions; try its SFT checkpoint for production fine-tuning.

Evidence RefTable 5

Token-decomposition initialization for both embedding and LM head cut pretraining eval loss from ~17.5 to ~11.1 and improved next-token accuracy from 0.0036 to 0.0930.

NumbersLoss 17.5061→11.0905; Accuracy 0.0036→0.0930

Practical UseWhen adding many new language tokens, initialize new embeddings by decomposing tokens into pretrained subtokens and include LM-head weights to speed convergence and lower loss.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
KoBEST AVG (pretrained)	0.6672	upstage/SOLAR-10.7B 0.5023	+0.1649	KoBEST (BoolQ,COPA,HellaSwag,SentiNeg)	RedWhale pretrained layers improved KoBEST AVG to 0.6672	Table 5
SFT	0.8083	yanolja/EEVE-Korean-10.8B-v1.0-SFT 0.7942	+0.0141	KoBEST after supervised fine-tuning	RedWhale-SFT outperforms EEVE-SFT by ~1.41 percentage points	Table 5

What To Try In 7 Days

Train a 20k-vocab SentencePiece on your Korean corpus and measure Token Ratio vs base tokenizer.

Implement token-decomposition initialization (subtoken average + LM-head) for new language tokens and run short CLM checks.

Run a staged training test: train embedding+head for one epoch, then a few transformer layers, then LoRA consolidation.

Optimization Features

Token Efficiency

custom tokenizer reduces token count (TR ≈0.57)

Infra Optimization

able to run on a single H100 or smaller multi-GPU setups; avoids OOM via staged updates

Model Optimization

token-decomposition initialization for embedding and LM headvocabulary pruning to 20k tokens

System Optimization

LoRA

Training Optimization

LoRAuse FlashAttention2 and SPDA to reduce memory

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.aihub.or.kr https://huggingface.co/datasets/davidkim205/kollm-converations https://ko.wikipedia.org

Risks & Boundaries

Limitations

Corpus filtering lacks a formal quantitative ablation; filtered selection may reduce diversity or remove useful samples.

Staged training ordering (odd/even layers) is proposed but not compared to other layer schedules.

When Not To Use

When you have abundant compute and prefer training from scratch for full control.

If your target domain requires very large, diverse web-scale corpora that may be harmed by aggressive filtering.

Failure Modes

Filtered corpus may remove rare but important language patterns, hurting downstream generalization.

Large fraction of new tokens have little overlap with base vocab — initialization may still be insufficient without a lot more training.

Core Entities

Models

RedWhaleyanolja/EEVE-Korean-10.8B-v1.0upstage/SOLAR-10.7B-v1.0mistralai/Mistral-7B-v0.1meta-llama/Llama-2-7b-hfbeomi/open-llama-2-ko-7b

Metrics

CLM LossEvaluation LossAccuracy

Datasets

AI Hub (Korean)Web-crawled Korean data (authors)SFTKoBEST benchmarkFinancial QA-1 (proprietary)Financial QA-2 (proprietary)

Benchmarks

KoBEST

Context Entities

Models

yanolja/EEVE-Korean-2.8B-v1.0upstage/SOLAR-10.7B-Instruct-v1.0beomi/OPEN-SOLAR-KO-10.7Bmicrosoft/Phi-2

Metrics

Token Ratio (TR)

Datasets

Korean WikipediaAI Hub public corporaModu Corpus

Benchmarks

internal Financial QA tasks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.

Token-decomposition initialization for both embedding and LM head cut pretraining eval loss from ~17.5 to ~11.1 and improved next-token accuracy from 0.0036 to 0.0930.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding