Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

May 31, 20247 min

Overview

Decision SnapshotNeeds Validation

The leaderboard and Ko-H5 are practical for Korean model evaluation today, but private tests and some datasets are proprietary or limited; evidence is moderate and mostly descriptive.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, Hwalsuk Lee

Links

Abstract / PDF / Code

Why It Matters For Business

Private test sets and language-tailored benchmarks give more reliable model comparisons and prevent overfitting to public datasets, so product teams can trust model selection for Korean deployments.

Who Should Care

Summary TLDR

This paper builds the Open Ko-LLM Leaderboard and the Ko-H5 benchmark to evaluate Korean LLMs. Ko-H5 reuses four English tasks (translated + human-reviewed) and adds Ko-CommonGen v2 (created from scratch) to increase task diversity. The authors keep test sets private, show overlap with popular Korean training sets is below 1%, analyze task correlations and temporal trends, and report quick saturation on some tasks (2–6 weeks to hit score 60). They recommend private tests, adding orthogonal tasks, tracking saturation, and community hygiene (model cards/licenses) to keep leaderboards useful.

Problem Statement

English-centric LLM benchmarks miss language-specific behaviors. Korean LLMs need a tailored, robust evaluation suite that avoids data leakage and captures diverse capabilities beyond standard English-based tasks.

Main Contribution

Open Ko-LLM Leaderboard: a Hugging Face-based leaderboard for Korean LLMs with private test sets.

Ko-H5 benchmark: five Korean tasks (Ko-ARC, Ko-HellaSwag, Ko-MMLU, Ko-TruthfulQA, Ko-CommonGen v2) assembled via machine+human translation and some native curation.

Key Findings

Private Ko-H5 test sets have very low overlap with popular Korean training corpora.

NumbersOverlap < 1% across tasks (Table 2)

Practical UseUse private test sets to reduce data contamination risk when evaluating open models; expect substantially less leak than public benchmarks.

Evidence RefTable 2

Ko-CommonGen v2 adds an evaluation axis that is less correlated with truthfulness and partially orthogonal to reasoning tasks.

NumbersKo-CommonGen v2 shows mid correlation with ARC/HellaSwag/MMLU and low with TruthfulQA (Figure 2)

Practical UseAdd generation-style and commonsense tasks (like CommonGen) to catch capabilities that standard reasoning/QA tasks miss.

Evidence RefFigure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Private test overlapAll overlaps < 1%Ko-H5 vs popular training setsAggressive dedup with similarity threshold 0.05; all overlaps under 1% (Table 2)Table 2
Weeks to reach score 60Ko-CommonGen v2 ≈2; Ko-HellaSwag ≈6; Ko-TruthfulQA ≈13; Ko-ARC/Ko-MMLU not reachedindividual Ko-H5 tasksTime-to-60 reported per task (Table 3, Figure 7)Table 3

What To Try In 7 Days

Add private holdout tests for your Korean evaluation to check leakage quickly.

Run per-task time-to-threshold tracking (e.g., weeks to score X) to detect saturation.

Include at least one generation/common-knowledge task (like CommonGen) to surface orthogonal capabilities.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Ko-H5 inherits structure from English Open LLM Leaderboard and is partly static; risk of task saturation.

Leaderboard caps submissions at 30B parameters, so it cannot evaluate heavier models.

When Not To Use

When you need evaluation of models >30B parameters (leaderboard cap).

When you require fully public test sets for reproducible public benchmarking (Ko-H5 tests are private).

Failure Modes

Performance saturation on easy/common-knowledge tasks reduces discriminative power.

Translation or cultural mismatch despite human review may introduce noise.

Core Entities

Models

submitted Korean LLMs on Open Ko-LLM Leaderboardpretrained modelsinstruction-tuned modelsRL-tuned models (not analyzed deeply)

Metrics

Ko-H5 aggregated scoreper-task scorestime-to-threshold (weeks to score 60)

Datasets

Ko-H5Ko-ARCKo-HellaSwagKo-MMLUKo-TruthfulQAKo-CommonGen v2

Benchmarks

Open Ko-LLM LeaderboardOpen LLM Leaderboard (English)