RedWhale: adapt an English LLM to Korean with small-data continual pretraining and tokenizer tweaks

August 21, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper shows practical wins (token reduction, improved KoBEST) and a clear recipe, but experiments are limited to specific corpora, two proprietary QA sets, and single-GPU runs; broader validation is needed before widespread production use.

Citations0

Evidence Strength0.60

Confidence0.74

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi

Links

Abstract / PDF / Data

Why It Matters For Business

You can adapt an English LLM to Korean with far less compute by filtering data, using a Korean-aware tokenizer, initializing new tokens smartly, and staged training—this lowers cost and enables deployments for teams without massive GPU clusters.

Who Should Care

Summary TLDR

RedWhale adapts an English LLM to Korean by (1) filtering a Korean corpus, (2) building a hybrid Korean-aware tokenizer, (3) initializing new token embeddings using subtoken + head decomposition, and (4) a staged continual-pretraining schedule with LoRA. On KoBEST and internal Financial QA, RedWhale matches or slightly beats top Korean-adapted models (EEVE) after ~9.7B tokens and ~498 H100 GPU hours. Key wins: ~0.57 token ratio (fewer tokens per input), a practical initialization that cuts pretraining loss, and a low-memory staged workflow for limited hardware.

Problem Statement

English-centric LLMs struggle on Korean due to different token structure and limited Korean corpora. Training full models from scratch is costly and often impossible on limited hardware. The paper asks: can we efficiently adapt an English LLM to Korean with limited compute, a tailored tokenizer, careful data filtering, and staged continual pretraining?

Main Contribution

A four-step pipeline for language adaptation: corpus filtering, Korean tokenizer adaptation, five initialization methods for new tokens, and a multistage training schedule with LoRA.

A hybrid SentencePiece tokenizer tuned for Korean (vocab ≈20k) that reduced token counts vs. base model (Token Ratio ≈0.57).

Key Findings

RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.

NumbersKoBEST AVG: RedWhale-SFT 0.8083 vs EEVE-SFT 0.7942

Practical UseIf you need a Korean-adapted model for downstream tasks, RedWhale gives marginally better KoBEST task accuracy than leading public adaptions; try its SFT checkpoint for production fine-tuning.

Evidence RefTable 5

Token-decomposition initialization for both embedding and LM head cut pretraining eval loss from ~17.5 to ~11.1 and improved next-token accuracy from 0.0036 to 0.0930.

NumbersLoss 17.506111.0905; Accuracy 0.00360.0930

Practical UseWhen adding many new language tokens, initialize new embeddings by decomposing tokens into pretrained subtokens and include LM-head weights to speed convergence and lower loss.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
KoBEST AVG (pretrained)0.6672upstage/SOLAR-10.7B 0.5023+0.1649KoBEST (BoolQ,COPA,HellaSwag,SentiNeg)RedWhale pretrained layers improved KoBEST AVG to 0.6672Table 5
SFT0.8083yanolja/EEVE-Korean-10.8B-v1.0-SFT 0.7942+0.0141KoBEST after supervised fine-tuningRedWhale-SFT outperforms EEVE-SFT by ~1.41 percentage pointsTable 5

What To Try In 7 Days

Train a 20k-vocab SentencePiece on your Korean corpus and measure Token Ratio vs base tokenizer.

Implement token-decomposition initialization (subtoken average + LM-head) for new language tokens and run short CLM checks.

Run a staged training test: train embedding+head for one epoch, then a few transformer layers, then LoRA consolidation.

Optimization Features

Token Efficiency
custom tokenizer reduces token count (TR ≈0.57)
Infra Optimization
able to run on a single H100 or smaller multi-GPU setups; avoids OOM via staged updates
Model Optimization
token-decomposition initialization for embedding and LM headvocabulary pruning to 20k tokens
System Optimization
LoRA
Training Optimization
LoRAuse FlashAttention2 and SPDA to reduce memory

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Corpus filtering lacks a formal quantitative ablation; filtered selection may reduce diversity or remove useful samples.

Staged training ordering (odd/even layers) is proposed but not compared to other layer schedules.

When Not To Use

When you have abundant compute and prefer training from scratch for full control.

If your target domain requires very large, diverse web-scale corpora that may be harmed by aggressive filtering.

Failure Modes

Filtered corpus may remove rare but important language patterns, hurting downstream generalization.

Large fraction of new tokens have little overlap with base vocab — initialization may still be insufficient without a lot more training.

Core Entities

Models

RedWhaleyanolja/EEVE-Korean-10.8B-v1.0upstage/SOLAR-10.7B-v1.0mistralai/Mistral-7B-v0.1meta-llama/Llama-2-7b-hfbeomi/open-llama-2-ko-7b

Metrics

CLM LossEvaluation LossAccuracy

Datasets

AI Hub (Korean)Web-crawled Korean data (authors)SFTKoBEST benchmarkFinancial QA-1 (proprietary)Financial QA-2 (proprietary)

Benchmarks

KoBEST

Context Entities

Models

yanolja/EEVE-Korean-2.8B-v1.0upstage/SOLAR-10.7B-Instruct-v1.0beomi/OPEN-SOLAR-KO-10.7Bmicrosoft/Phi-2

Metrics

Token Ratio (TR)

Datasets

Korean WikipediaAI Hub public corporaModu Corpus

Benchmarks

internal Financial QA tasks