Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can adapt an English LLM to Korean with far less compute by filtering data, using a Korean-aware tokenizer, initializing new tokens smartly, and staged training—this lowers cost and enables deployments for teams without massive GPU clusters.
Summary TLDR
RedWhale adapts an English LLM to Korean by (1) filtering a Korean corpus, (2) building a hybrid Korean-aware tokenizer, (3) initializing new token embeddings using subtoken + head decomposition, and (4) a staged continual-pretraining schedule with LoRA. On KoBEST and internal Financial QA, RedWhale matches or slightly beats top Korean-adapted models (EEVE) after ~9.7B tokens and ~498 H100 GPU hours. Key wins: ~0.57 token ratio (fewer tokens per input), a practical initialization that cuts pretraining loss, and a low-memory staged workflow for limited hardware.
Problem Statement
English-centric LLMs struggle on Korean due to different token structure and limited Korean corpora. Training full models from scratch is costly and often impossible on limited hardware. The paper asks: can we efficiently adapt an English LLM to Korean with limited compute, a tailored tokenizer, careful data filtering, and staged continual pretraining?
Main Contribution
A four-step pipeline for language adaptation: corpus filtering, Korean tokenizer adaptation, five initialization methods for new tokens, and a multistage training schedule with LoRA.
A hybrid SentencePiece tokenizer tuned for Korean (vocab ≈20k) that reduced token counts vs. base model (Token Ratio ≈0.57).
Empirical comparison of five initialization methods; token decomposition applied to both embedding and LM head performed best.
A staged training recipe (embedding/head → odd/even layers → full model with LoRA) intended to avoid OOM and fit limited GPUs.
Key Findings
RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.
Token-decomposition initialization for both embedding and LM head cut pretraining eval loss from ~17.5 to ~11.1 and improved next-token accuracy from 0.0036 to 0.0930.
Custom corpus filtering reduced raw Korean data from 121.6GB to 43.9GB and produced ~9.7B training tokens (custom tokenizer counted).
The adapted tokenizer produced a Token Ratio of ≈0.57 (fewer tokens per input) enabling longer effective contexts and faster input processing.
Training used one NVIDIA H100 for ~498 GPU hours for the reported pretraining schedule.
Results
KoBEST AVG (pretrained)
SFT
Initialization: eval loss
Pretraining tokens
Training compute
Token Ratio (TR)
Who Should Care
What To Try In 7 Days
Train a 20k-vocab SentencePiece on your Korean corpus and measure Token Ratio vs base tokenizer.
Implement token-decomposition initialization (subtoken average + LM-head) for new language tokens and run short CLM checks.
Run a staged training test: train embedding+head for one epoch, then a few transformer layers, then LoRA consolidation.
Optimization Features
Token Efficiency
- custom tokenizer reduces token count (TR ≈0.57)
Infra Optimization
- able to run on a single H100 or smaller multi-GPU setups; avoids OOM via staged updates
Model Optimization
- token-decomposition initialization for embedding and LM head
- vocabulary pruning to 20k tokens
System Optimization
- LoRA
Training Optimization
- LoRA
- use FlashAttention2 and SPDA to reduce memory
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Corpus filtering lacks a formal quantitative ablation; filtered selection may reduce diversity or remove useful samples.
- Staged training ordering (odd/even layers) is proposed but not compared to other layer schedules.
- Financial QA datasets are proprietary and may limit external replication of domain results.
When Not To Use
- When you have abundant compute and prefer training from scratch for full control.
- If your target domain requires very large, diverse web-scale corpora that may be harmed by aggressive filtering.
Failure Modes
- Filtered corpus may remove rare but important language patterns, hurting downstream generalization.
- Large fraction of new tokens have little overlap with base vocab — initialization may still be insufficient without a lot more training.
- Staged updates could converge to local optima if not tuned carefully (learning rates, schedules).
Core Entities
Models
- RedWhale
- yanolja/EEVE-Korean-10.8B-v1.0
- upstage/SOLAR-10.7B-v1.0
- mistralai/Mistral-7B-v0.1
- meta-llama/Llama-2-7b-hf
- beomi/open-llama-2-ko-7b
Metrics
- CLM Loss
- Evaluation Loss
- Accuracy
Datasets
- AI Hub (Korean)
- Web-crawled Korean data (authors)
- SFT
- KoBEST benchmark
- Financial QA-1 (proprietary)
- Financial QA-2 (proprietary)
Benchmarks
- KoBEST
Context Entities
Models
- yanolja/EEVE-Korean-2.8B-v1.0
- upstage/SOLAR-10.7B-Instruct-v1.0
- beomi/OPEN-SOLAR-KO-10.7B
- microsoft/Phi-2
Metrics
- Token Ratio (TR)
Datasets
- Korean Wikipedia
- AI Hub public corpora
- Modu Corpus
Benchmarks
- internal Financial QA tasks

