KoLA: a focused, evolving benchmark that measures LLM world knowledge and flags hallucinated creations

Overview

Decision SnapshotReady For Pilot

KoLA is production-ready as an evaluation service and diagnostic tool; its automated self-contrast metric is validated against human judgments but has limitations for truly novel correct generations.

Citations24

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, Juanzi Li

Links

Abstract / PDF

Why It Matters For Business

KoLA gives a practical, evolving way to compare models on factual recall, understanding, reasoning, and creation while flagging hallucinated facts automatically—helpful when choosing models for QA, knowledge work, or content generation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder Engineering Lead

Summary TLDR

KoLA is a benchmark that targets LLMs' world knowledge. It organizes 19 tasks into a four-level taxonomy (memorize, understand, apply, create), uses both a "known" corpus (Wikipedia/Wikidata5M) and a periodically crawled "evolving" corpus, and introduces a contrastive scoring system plus a self-contrast metric (comparing free vs knowledge-grounded completions) to detect hallucination in generated knowledge. Authors ran two seasons evaluating 28 models and report practical findings about model size, instruction tuning, and open-source gaps. The benchmark and toolkit are maintained and updated every ~90 days.

Problem Statement

Current LLM benchmarks mix many tasks without modeling how knowledge abilities relate, and test sets can be leaked or stale. KoLA aims to (1) stratify knowledge abilities into four actionable levels, (2) pair "known" and "evolving" data to reduce training-data bias, and (3) provide comparable, automated metrics (standardized scores and a self-contrast measure) that highlight when generated knowledge is hallucinated.

Main Contribution

A four-level cognitive taxonomy for world knowledge: Knowledge Memorization, Understanding, Applying, Creating.

A dual data design: Known data (Wikipedia/Wikidata5M) plus an evolving corpus (≥500 recent articles per season) to test unseen and time-sensitive knowledge.

Key Findings

Model size strongly predicts memorization for non-aligned models.

NumbersSpearman ρ = 0.79 between KM rank and model size (non-aligned models)

Practical UseWhen you need factual recall from training data, prefer larger base models unless alignment steps have been applied.

Evidence RefSec 3 (Overall Performance)

Instruction tuning (alignment) increases size correlation with higher-level abilities but can reduce memorization.

NumbersKA size correlation rose from 0.02 to 0.53 after instruction tuning; KM correlation dropped to 0.34 (alignment tax)

Practical UseExpect instruction-tuned models to be better at reasoning and creative tasks but test memorization-sensitive use cases separately.

Evidence RefSec 3 (Overall Performance)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Self-contrast vs human faithfulness	Spearman ρ = 0.61	—	removing self-contrast -> 32% drop in correlation	Knowledge Creating tasks (KC)	Sec 3 (Design Analysis)	Sec 3
Memorization–size correlation (non-aligned models)	Spearman ρ = 0.79	—	—	Knowledge Memorization (KM)	Sec 3 (Overall Performance)	Sec 3, Table 2

What To Try In 7 Days

Run your top candidate models on KoLA's public tasks or examples to see which level (memorize/understand/apply/create) they struggle with.

Add a self-contrast check: generate free completion and a knowledge-grounded completion and compute ROUGE-L similarity to catch hallucinated facts.

If you rely on memorized facts, test both a base model and its instruction-tuned variant to measure any "alignment tax" on recall.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Coverage limited to 19 English datasets focusing on entities, concepts, and events.

Evolving test sets are small (≈500 articles per season) and may not cover all domains.

When Not To Use

When you need evaluations in non-English languages or multimodal tasks.

When your application depends on domain-specific knowledge not covered by KoLA's datasets.

Failure Modes

Self-contrast flags may miss novel correct facts that are valid but not present in references.

Instruction tuning may improve reasoning but reduce raw memorization (alignment tax).

Core Entities

Models

GPT-4GPT-3.5-turboInstructGPT davinci v2GPT-3 davinci v1Cohere-commandFLAN-UL2FLAN-T5LLaMaLlama2-chatGLM-130BGPT-JGPT-NeoXBLOOMVicunaAlpacaTuluChatGLMChatGLM2-32kJ2-Jumbo-InstructRedPajama-InstructDolly-v2UL2GPT-JTInternlm-chat-8k

Metrics

Standardized score (z then min-max to 0-100)Exact Match (EM)Token F1AccuracyROUGE-L (used in self-contrast)Precision/Recall/F1 for RE and event tasks

Datasets

Wikipedia (Wikidata5M subset)Evolving news corpus (seasonal)Evolving fiction corpus (AO3)HotpotQA2WikiMultihopQAMuSiQueKQA ProKoRCDocREDFewNERDMAVENMAVEN-ERECOPENEncyclopedic (MAVEN subset)

Benchmarks

KoLA (this paper)LAMA-style probingHotpotQADocREDKQA ProMuSiQue

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Model size strongly predicts memorization for non-aligned models.

Instruction tuning (alignment) increases size correlation with higher-level abilities but can reduce memorization.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding