RedWhale: adapt an English LLM to Korean with small-data continual pretraining and tokenizer tweaks

August 21, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

0

Authors

Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi

Links

Abstract / PDF

Why It Matters For Business

You can adapt an English LLM to Korean with far less compute by filtering data, using a Korean-aware tokenizer, initializing new tokens smartly, and staged training—this lowers cost and enables deployments for teams without massive GPU clusters.

Summary TLDR

RedWhale adapts an English LLM to Korean by (1) filtering a Korean corpus, (2) building a hybrid Korean-aware tokenizer, (3) initializing new token embeddings using subtoken + head decomposition, and (4) a staged continual-pretraining schedule with LoRA. On KoBEST and internal Financial QA, RedWhale matches or slightly beats top Korean-adapted models (EEVE) after ~9.7B tokens and ~498 H100 GPU hours. Key wins: ~0.57 token ratio (fewer tokens per input), a practical initialization that cuts pretraining loss, and a low-memory staged workflow for limited hardware.

Problem Statement

English-centric LLMs struggle on Korean due to different token structure and limited Korean corpora. Training full models from scratch is costly and often impossible on limited hardware. The paper asks: can we efficiently adapt an English LLM to Korean with limited compute, a tailored tokenizer, careful data filtering, and staged continual pretraining?

Main Contribution

A four-step pipeline for language adaptation: corpus filtering, Korean tokenizer adaptation, five initialization methods for new tokens, and a multistage training schedule with LoRA.

A hybrid SentencePiece tokenizer tuned for Korean (vocab ≈20k) that reduced token counts vs. base model (Token Ratio ≈0.57).

Empirical comparison of five initialization methods; token decomposition applied to both embedding and LM head performed best.

A staged training recipe (embedding/head → odd/even layers → full model with LoRA) intended to avoid OOM and fit limited GPUs.

Key Findings

RedWhale's fine-tuned model (SFT) achieves KoBEST average 80.83%, slightly above EEVE's 79.42% on evaluated tasks.

NumbersKoBEST AVG: RedWhale-SFT 0.8083 vs EEVE-SFT 0.7942

Token-decomposition initialization for both embedding and LM head cut pretraining eval loss from ~17.5 to ~11.1 and improved next-token accuracy from 0.0036 to 0.0930.

NumbersLoss 17.5061→11.0905; Accuracy 0.0036→0.0930

Custom corpus filtering reduced raw Korean data from 121.6GB to 43.9GB and produced ~9.7B training tokens (custom tokenizer counted).

NumbersRaw 121.6GB → Processed 43.9GB; Train tokens ≈9.7B

The adapted tokenizer produced a Token Ratio of ≈0.57 (fewer tokens per input) enabling longer effective contexts and faster input processing.

NumbersToken Ratio (TR) ≈0.57 average

Training used one NVIDIA H100 for ~498 GPU hours for the reported pretraining schedule.

NumbersTraining ≈498 H100 GPU hours; pretraining ≈700 ZFLOPS estimated

Results

KoBEST AVG (pretrained)

Value0.6672

Baselineupstage/SOLAR-10.7B 0.5023

SFT

Value0.8083

Baselineyanolja/EEVE-Korean-10.8B-v1.0-SFT 0.7942

Initialization: eval loss

Value11.0905

BaselineRandom init eval loss 17.5061

Pretraining tokens

Value9.7B

Training compute

Value≈498 GPU hours (H100)

Token Ratio (TR)

Value0.57

Baselinebase tokenizer 1.0

Who Should Care

What To Try In 7 Days

Train a 20k-vocab SentencePiece on your Korean corpus and measure Token Ratio vs base tokenizer.

Implement token-decomposition initialization (subtoken average + LM-head) for new language tokens and run short CLM checks.

Run a staged training test: train embedding+head for one epoch, then a few transformer layers, then LoRA consolidation.

Optimization Features

Token Efficiency

  • custom tokenizer reduces token count (TR ≈0.57)

Infra Optimization

  • able to run on a single H100 or smaller multi-GPU setups; avoids OOM via staged updates

Model Optimization

  • token-decomposition initialization for embedding and LM head
  • vocabulary pruning to 20k tokens

System Optimization

  • LoRA

Training Optimization

  • LoRA
  • use FlashAttention2 and SPDA to reduce memory

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Corpus filtering lacks a formal quantitative ablation; filtered selection may reduce diversity or remove useful samples.
  • Staged training ordering (odd/even layers) is proposed but not compared to other layer schedules.
  • Financial QA datasets are proprietary and may limit external replication of domain results.

When Not To Use

  • When you have abundant compute and prefer training from scratch for full control.
  • If your target domain requires very large, diverse web-scale corpora that may be harmed by aggressive filtering.

Failure Modes

  • Filtered corpus may remove rare but important language patterns, hurting downstream generalization.
  • Large fraction of new tokens have little overlap with base vocab — initialization may still be insufficient without a lot more training.
  • Staged updates could converge to local optima if not tuned carefully (learning rates, schedules).

Core Entities

Models

  • RedWhale
  • yanolja/EEVE-Korean-10.8B-v1.0
  • upstage/SOLAR-10.7B-v1.0
  • mistralai/Mistral-7B-v0.1
  • meta-llama/Llama-2-7b-hf
  • beomi/open-llama-2-ko-7b

Metrics

  • CLM Loss
  • Evaluation Loss
  • Accuracy

Datasets

  • AI Hub (Korean)
  • Web-crawled Korean data (authors)
  • SFT
  • KoBEST benchmark
  • Financial QA-1 (proprietary)
  • Financial QA-2 (proprietary)

Benchmarks

  • KoBEST

Context Entities

Models

  • yanolja/EEVE-Korean-2.8B-v1.0
  • upstage/SOLAR-10.7B-Instruct-v1.0
  • beomi/OPEN-SOLAR-KO-10.7B
  • microsoft/Phi-2

Metrics

  • Token Ratio (TR)

Datasets

  • Korean Wikipedia
  • AI Hub public corpora
  • Modu Corpus

Benchmarks

  • internal Financial QA tasks