Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Overview

Decision SnapshotNeeds Validation

The leaderboard and Ko-H5 are practical for Korean model evaluation today, but private tests and some datasets are proprietary or limited; evidence is moderate and mostly descriptive.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, Hwalsuk Lee

Links

Abstract / PDF / Code

Why It Matters For Business

Private test sets and language-tailored benchmarks give more reliable model comparisons and prevent overfitting to public datasets, so product teams can trust model selection for Korean deployments.

Who Should Care

CTO Product Manager ML Engineer Founder Data Scientist Engineering Lead

Summary TLDR

This paper builds the Open Ko-LLM Leaderboard and the Ko-H5 benchmark to evaluate Korean LLMs. Ko-H5 reuses four English tasks (translated + human-reviewed) and adds Ko-CommonGen v2 (created from scratch) to increase task diversity. The authors keep test sets private, show overlap with popular Korean training sets is below 1%, analyze task correlations and temporal trends, and report quick saturation on some tasks (2–6 weeks to hit score 60). They recommend private tests, adding orthogonal tasks, tracking saturation, and community hygiene (model cards/licenses) to keep leaderboards useful.

Problem Statement

English-centric LLM benchmarks miss language-specific behaviors. Korean LLMs need a tailored, robust evaluation suite that avoids data leakage and captures diverse capabilities beyond standard English-based tasks.

Main Contribution

Open Ko-LLM Leaderboard: a Hugging Face-based leaderboard for Korean LLMs with private test sets.

Ko-H5 benchmark: five Korean tasks (Ko-ARC, Ko-HellaSwag, Ko-MMLU, Ko-TruthfulQA, Ko-CommonGen v2) assembled via machine+human translation and some native curation.

Key Findings

Private Ko-H5 test sets have very low overlap with popular Korean training corpora.

NumbersOverlap < 1% across tasks (Table 2)

Practical UseUse private test sets to reduce data contamination risk when evaluating open models; expect substantially less leak than public benchmarks.

Evidence RefTable 2

Ko-CommonGen v2 adds an evaluation axis that is less correlated with truthfulness and partially orthogonal to reasoning tasks.

NumbersKo-CommonGen v2 shows mid correlation with ARC/HellaSwag/MMLU and low with TruthfulQA (Figure 2)

Practical UseAdd generation-style and commonsense tasks (like CommonGen) to catch capabilities that standard reasoning/QA tasks miss.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Private test overlap	All overlaps < 1%	—	—	Ko-H5 vs popular training sets	Aggressive dedup with similarity threshold 0.05; all overlaps under 1% (Table 2)	Table 2
Weeks to reach score 60	Ko-CommonGen v2 ≈2; Ko-HellaSwag ≈6; Ko-TruthfulQA ≈13; Ko-ARC/Ko-MMLU not reached	—	—	individual Ko-H5 tasks	Time-to-60 reported per task (Table 3, Figure 7)	Table 3

What To Try In 7 Days

Add private holdout tests for your Korean evaluation to check leakage quickly.

Run per-task time-to-threshold tracking (e.g., weeks to score X) to detect saturation.

Include at least one generation/common-knowledge task (like CommonGen) to surface orthogonal capabilities.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard

Risks & Boundaries

Limitations

Ko-H5 inherits structure from English Open LLM Leaderboard and is partly static; risk of task saturation.

Leaderboard caps submissions at 30B parameters, so it cannot evaluate heavier models.

When Not To Use

When you need evaluation of models >30B parameters (leaderboard cap).

When you require fully public test sets for reproducible public benchmarking (Ko-H5 tests are private).

Failure Modes

Performance saturation on easy/common-knowledge tasks reduces discriminative power.

Translation or cultural mismatch despite human review may introduce noise.

Core Entities

Models

submitted Korean LLMs on Open Ko-LLM Leaderboardpretrained modelsinstruction-tuned modelsRL-tuned models (not analyzed deeply)

Metrics

Ko-H5 aggregated scoreper-task scorestime-to-threshold (weeks to score 60)

Datasets

Ko-H5Ko-ARCKo-HellaSwagKo-MMLUKo-TruthfulQAKo-CommonGen v2

Benchmarks

Open Ko-LLM LeaderboardOpen LLM Leaderboard (English)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Private Ko-H5 test sets have very low overlap with popular Korean training corpora.

Ko-CommonGen v2 adds an evaluation axis that is less correlated with truthfulness and partially orthogonal to reasoning tasks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding