Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Private test sets and language-tailored benchmarks give more reliable model comparisons and prevent overfitting to public datasets, so product teams can trust model selection for Korean deployments.
Summary TLDR
This paper builds the Open Ko-LLM Leaderboard and the Ko-H5 benchmark to evaluate Korean LLMs. Ko-H5 reuses four English tasks (translated + human-reviewed) and adds Ko-CommonGen v2 (created from scratch) to increase task diversity. The authors keep test sets private, show overlap with popular Korean training sets is below 1%, analyze task correlations and temporal trends, and report quick saturation on some tasks (2–6 weeks to hit score 60). They recommend private tests, adding orthogonal tasks, tracking saturation, and community hygiene (model cards/licenses) to keep leaderboards useful.
Problem Statement
English-centric LLM benchmarks miss language-specific behaviors. Korean LLMs need a tailored, robust evaluation suite that avoids data leakage and captures diverse capabilities beyond standard English-based tasks.
Main Contribution
Open Ko-LLM Leaderboard: a Hugging Face-based leaderboard for Korean LLMs with private test sets.
Ko-H5 benchmark: five Korean tasks (Ko-ARC, Ko-HellaSwag, Ko-MMLU, Ko-TruthfulQA, Ko-CommonGen v2) assembled via machine+human translation and some native curation.
Data-leakage analysis showing private tests have minimal overlap (<1%) with common training sets.
Multi-angle analyses: inter-task correlations, temporal score trends by model size/type, and saturation statistics leading to practical rules for expanding benchmarks.
Operational findings and community recommendations (model card quality, submission hygiene).
Key Findings
Private Ko-H5 test sets have very low overlap with popular Korean training corpora.
Ko-CommonGen v2 adds an evaluation axis that is less correlated with truthfulness and partially orthogonal to reasoning tasks.
Some Ko-H5 tasks saturate quickly on the leaderboard.
Model size and training stage shape performance trends and correlations.
Instruction-tuned models closely follow pretrained model improvements with a short delay.
Many leaderboard submissions have documentation or availability issues.
Results
Private test overlap
Weeks to reach score 60
Submission issues: model card problems
Who Should Care
What To Try In 7 Days
Add private holdout tests for your Korean evaluation to check leakage quickly.
Run per-task time-to-threshold tracking (e.g., weeks to score X) to detect saturation.
Include at least one generation/common-knowledge task (like CommonGen) to surface orthogonal capabilities.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Ko-H5 inherits structure from English Open LLM Leaderboard and is partly static; risk of task saturation.
- Leaderboard caps submissions at 30B parameters, so it cannot evaluate heavier models.
- Temporal analyses cover only four+ months; longer trends may differ.
- Ko-HellaSwag received little manual review due to cost; quality could improve with more human curation.
When Not To Use
- When you need evaluation of models >30B parameters (leaderboard cap).
- When you require fully public test sets for reproducible public benchmarking (Ko-H5 tests are private).
- As a final arbiter of real-time model quality—leaderboard is a snapshot and evolves over time.
Failure Modes
- Performance saturation on easy/common-knowledge tasks reduces discriminative power.
- Translation or cultural mismatch despite human review may introduce noise.
- Model-card and hosting issues reduce reproducibility and blocker for reuse.
- Temporal spikes tied to community releases may bias perceived progress.
Core Entities
Models
- submitted Korean LLMs on Open Ko-LLM Leaderboard
- pretrained models
- instruction-tuned models
- RL-tuned models (not analyzed deeply)
Metrics
- Ko-H5 aggregated score
- per-task scores
- time-to-threshold (weeks to score 60)
Datasets
- Ko-H5
- Ko-ARC
- Ko-HellaSwag
- Ko-MMLU
- Ko-TruthfulQA
- Ko-CommonGen v2
Benchmarks
- Open Ko-LLM Leaderboard
- Open LLM Leaderboard (English)

