Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

May 31, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

1

Authors

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, Hwalsuk Lee

Links

Abstract / PDF

Why It Matters For Business

Private test sets and language-tailored benchmarks give more reliable model comparisons and prevent overfitting to public datasets, so product teams can trust model selection for Korean deployments.

Summary TLDR

This paper builds the Open Ko-LLM Leaderboard and the Ko-H5 benchmark to evaluate Korean LLMs. Ko-H5 reuses four English tasks (translated + human-reviewed) and adds Ko-CommonGen v2 (created from scratch) to increase task diversity. The authors keep test sets private, show overlap with popular Korean training sets is below 1%, analyze task correlations and temporal trends, and report quick saturation on some tasks (2–6 weeks to hit score 60). They recommend private tests, adding orthogonal tasks, tracking saturation, and community hygiene (model cards/licenses) to keep leaderboards useful.

Problem Statement

English-centric LLM benchmarks miss language-specific behaviors. Korean LLMs need a tailored, robust evaluation suite that avoids data leakage and captures diverse capabilities beyond standard English-based tasks.

Main Contribution

Open Ko-LLM Leaderboard: a Hugging Face-based leaderboard for Korean LLMs with private test sets.

Ko-H5 benchmark: five Korean tasks (Ko-ARC, Ko-HellaSwag, Ko-MMLU, Ko-TruthfulQA, Ko-CommonGen v2) assembled via machine+human translation and some native curation.

Data-leakage analysis showing private tests have minimal overlap (<1%) with common training sets.

Multi-angle analyses: inter-task correlations, temporal score trends by model size/type, and saturation statistics leading to practical rules for expanding benchmarks.

Operational findings and community recommendations (model card quality, submission hygiene).

Key Findings

Private Ko-H5 test sets have very low overlap with popular Korean training corpora.

NumbersOverlap < 1% across tasks (Table 2)

Ko-CommonGen v2 adds an evaluation axis that is less correlated with truthfulness and partially orthogonal to reasoning tasks.

NumbersKo-CommonGen v2 shows mid correlation with ARC/HellaSwag/MMLU and low with TruthfulQA (Figure 2)

Some Ko-H5 tasks saturate quickly on the leaderboard.

NumbersWeeks to reach score 60: Ko-CommonGen v2 ≈2w, Ko-HellaSwag ≈6w, Ko-TruthfulQA ≈13w, Ko-ARC/Ko-MMLU not reached (Table 3)

Model size and training stage shape performance trends and correlations.

NumbersSmaller models (0–3B) lag and show negative/low correlations across some tasks; mid/large models (3–14B) show higher, in

Instruction-tuned models closely follow pretrained model improvements with a short delay.

NumbersHigh time-series correlation at 0–2 week lag between pretrained and instruction-tuned scores (Figure 6)

Many leaderboard submissions have documentation or availability issues.

NumbersModel-card related issues in 62.3% of 772 submissions (Table 4)

Results

Private test overlap

ValueAll overlaps < 1%

Weeks to reach score 60

ValueKo-CommonGen v2 ≈2; Ko-HellaSwag ≈6; Ko-TruthfulQA ≈13; Ko-ARC/Ko-MMLU not reached

Submission issues: model card problems

Value62.3% of submissions

Who Should Care

What To Try In 7 Days

Add private holdout tests for your Korean evaluation to check leakage quickly.

Run per-task time-to-threshold tracking (e.g., weeks to score X) to detect saturation.

Include at least one generation/common-knowledge task (like CommonGen) to surface orthogonal capabilities.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Ko-H5 inherits structure from English Open LLM Leaderboard and is partly static; risk of task saturation.
  • Leaderboard caps submissions at 30B parameters, so it cannot evaluate heavier models.
  • Temporal analyses cover only four+ months; longer trends may differ.
  • Ko-HellaSwag received little manual review due to cost; quality could improve with more human curation.

When Not To Use

  • When you need evaluation of models >30B parameters (leaderboard cap).
  • When you require fully public test sets for reproducible public benchmarking (Ko-H5 tests are private).
  • As a final arbiter of real-time model quality—leaderboard is a snapshot and evolves over time.

Failure Modes

  • Performance saturation on easy/common-knowledge tasks reduces discriminative power.
  • Translation or cultural mismatch despite human review may introduce noise.
  • Model-card and hosting issues reduce reproducibility and blocker for reuse.
  • Temporal spikes tied to community releases may bias perceived progress.

Core Entities

Models

  • submitted Korean LLMs on Open Ko-LLM Leaderboard
  • pretrained models
  • instruction-tuned models
  • RL-tuned models (not analyzed deeply)

Metrics

  • Ko-H5 aggregated score
  • per-task scores
  • time-to-threshold (weeks to score 60)

Datasets

  • Ko-H5
  • Ko-ARC
  • Ko-HellaSwag
  • Ko-MMLU
  • Ko-TruthfulQA
  • Ko-CommonGen v2

Benchmarks

  • Open Ko-LLM Leaderboard
  • Open LLM Leaderboard (English)