KoLA: a focused, evolving benchmark that measures LLM world knowledge and flags hallucinated creations

June 15, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

24

Authors

Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, Juanzi Li

Links

Abstract / PDF

Why It Matters For Business

KoLA gives a practical, evolving way to compare models on factual recall, understanding, reasoning, and creation while flagging hallucinated facts automatically—helpful when choosing models for QA, knowledge work, or content generation.

Summary TLDR

KoLA is a benchmark that targets LLMs' world knowledge. It organizes 19 tasks into a four-level taxonomy (memorize, understand, apply, create), uses both a "known" corpus (Wikipedia/Wikidata5M) and a periodically crawled "evolving" corpus, and introduces a contrastive scoring system plus a self-contrast metric (comparing free vs knowledge-grounded completions) to detect hallucination in generated knowledge. Authors ran two seasons evaluating 28 models and report practical findings about model size, instruction tuning, and open-source gaps. The benchmark and toolkit are maintained and updated every ~90 days.

Problem Statement

Current LLM benchmarks mix many tasks without modeling how knowledge abilities relate, and test sets can be leaked or stale. KoLA aims to (1) stratify knowledge abilities into four actionable levels, (2) pair "known" and "evolving" data to reduce training-data bias, and (3) provide comparable, automated metrics (standardized scores and a self-contrast measure) that highlight when generated knowledge is hallucinated.

Main Contribution

A four-level cognitive taxonomy for world knowledge: Knowledge Memorization, Understanding, Applying, Creating.

A dual data design: Known data (Wikipedia/Wikidata5M) plus an evolving corpus (≥500 recent articles per season) to test unseen and time-sensitive knowledge.

A contrastive evaluation system: standardized cross-task scores plus a self-contrast metric (compare free vs knowledge-grounded completions) to detect hallucination automatically.

A publicly maintained leaderboard and toolkit; two seasons of evaluations covering 28 open-source and commercial LLMs with diagnostic analyses.

Key Findings

Model size strongly predicts memorization for non-aligned models.

NumbersSpearman ρ = 0.79 between KM rank and model size (non-aligned models)

Instruction tuning (alignment) increases size correlation with higher-level abilities but can reduce memorization.

NumbersKA size correlation rose from 0.02 to 0.53 after instruction tuning; KM correlation dropped to 0.34 (alignment tax)

Self-contrast metric relates to human judgments of faithfulness.

NumbersSpearman ρ = 0.61 between ∂(T,T_k) and human faithfulness; removing it drops correlation with human quality by 32%

Open-source models trail commercial APIs on these knowledge tasks.

NumbersOpen-source average standardized z-score = -0.29 (below overall average)

Results

Self-contrast vs human faithfulness

ValueSpearman ρ = 0.61

Memorization–size correlation (non-aligned models)

ValueSpearman ρ = 0.79

Open-source average standardized z-score

Value-0.29

Baselineoverall average 0.00

Instruction tuning effect on size correlation for KA

Valuerose 0.02 -> 0.53

Baseline0.02 (pre-tuning)

Who Should Care

What To Try In 7 Days

Run your top candidate models on KoLA's public tasks or examples to see which level (memorize/understand/apply/create) they struggle with.

Add a self-contrast check: generate free completion and a knowledge-grounded completion and compute ROUGE-L similarity to catch hallucinated facts.

If you rely on memorized facts, test both a base model and its instruction-tuned variant to measure any "alignment tax" on recall.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Coverage limited to 19 English datasets focusing on entities, concepts, and events.
  • Evolving test sets are small (≈500 articles per season) and may not cover all domains.
  • Self-contrast can undervalue genuinely novel but correct model-generated knowledge that differs from human references.
  • Some tasks require long or structured inputs that exceed certain models' context windows, causing missing scores.

When Not To Use

  • When you need evaluations in non-English languages or multimodal tasks.
  • When your application depends on domain-specific knowledge not covered by KoLA's datasets.
  • If you require absolute, non-relative scoring without comparison to other models.

Failure Modes

  • Self-contrast flags may miss novel correct facts that are valid but not present in references.
  • Instruction tuning may improve reasoning but reduce raw memorization (alignment tax).
  • Standardized scores depend on the model pool: adding or removing high performers shifts relative scores.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • InstructGPT davinci v2
  • GPT-3 davinci v1
  • Cohere-command
  • FLAN-UL2
  • FLAN-T5
  • LLaMa
  • Llama2-chat
  • GLM-130B
  • GPT-J
  • GPT-NeoX
  • BLOOM
  • Vicuna
  • Alpaca
  • Tulu
  • ChatGLM
  • ChatGLM2-32k
  • J2-Jumbo-Instruct
  • RedPajama-Instruct
  • Dolly-v2
  • UL2
  • GPT-JT
  • Internlm-chat-8k

Metrics

  • Standardized score (z then min-max to 0-100)
  • Exact Match (EM)
  • Token F1
  • Accuracy
  • ROUGE-L (used in self-contrast)
  • Precision/Recall/F1 for RE and event tasks

Datasets

  • Wikipedia (Wikidata5M subset)
  • Evolving news corpus (seasonal)
  • Evolving fiction corpus (AO3)
  • HotpotQA
  • 2WikiMultihopQA
  • MuSiQue
  • KQA Pro
  • KoRC
  • DocRED
  • FewNERD
  • MAVEN
  • MAVEN-ERE
  • COPEN
  • Encyclopedic (MAVEN subset)

Benchmarks

  • KoLA (this paper)
  • LAMA-style probing
  • HotpotQA
  • DocRED
  • KQA Pro
  • MuSiQue