Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
24
Why It Matters For Business
KoLA gives a practical, evolving way to compare models on factual recall, understanding, reasoning, and creation while flagging hallucinated facts automatically—helpful when choosing models for QA, knowledge work, or content generation.
Summary TLDR
KoLA is a benchmark that targets LLMs' world knowledge. It organizes 19 tasks into a four-level taxonomy (memorize, understand, apply, create), uses both a "known" corpus (Wikipedia/Wikidata5M) and a periodically crawled "evolving" corpus, and introduces a contrastive scoring system plus a self-contrast metric (comparing free vs knowledge-grounded completions) to detect hallucination in generated knowledge. Authors ran two seasons evaluating 28 models and report practical findings about model size, instruction tuning, and open-source gaps. The benchmark and toolkit are maintained and updated every ~90 days.
Problem Statement
Current LLM benchmarks mix many tasks without modeling how knowledge abilities relate, and test sets can be leaked or stale. KoLA aims to (1) stratify knowledge abilities into four actionable levels, (2) pair "known" and "evolving" data to reduce training-data bias, and (3) provide comparable, automated metrics (standardized scores and a self-contrast measure) that highlight when generated knowledge is hallucinated.
Main Contribution
A four-level cognitive taxonomy for world knowledge: Knowledge Memorization, Understanding, Applying, Creating.
A dual data design: Known data (Wikipedia/Wikidata5M) plus an evolving corpus (≥500 recent articles per season) to test unseen and time-sensitive knowledge.
A contrastive evaluation system: standardized cross-task scores plus a self-contrast metric (compare free vs knowledge-grounded completions) to detect hallucination automatically.
A publicly maintained leaderboard and toolkit; two seasons of evaluations covering 28 open-source and commercial LLMs with diagnostic analyses.
Key Findings
Model size strongly predicts memorization for non-aligned models.
Instruction tuning (alignment) increases size correlation with higher-level abilities but can reduce memorization.
Self-contrast metric relates to human judgments of faithfulness.
Open-source models trail commercial APIs on these knowledge tasks.
Results
Self-contrast vs human faithfulness
Memorization–size correlation (non-aligned models)
Open-source average standardized z-score
Instruction tuning effect on size correlation for KA
Who Should Care
What To Try In 7 Days
Run your top candidate models on KoLA's public tasks or examples to see which level (memorize/understand/apply/create) they struggle with.
Add a self-contrast check: generate free completion and a knowledge-grounded completion and compute ROUGE-L similarity to catch hallucinated facts.
If you rely on memorized facts, test both a base model and its instruction-tuned variant to measure any "alignment tax" on recall.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Coverage limited to 19 English datasets focusing on entities, concepts, and events.
- Evolving test sets are small (≈500 articles per season) and may not cover all domains.
- Self-contrast can undervalue genuinely novel but correct model-generated knowledge that differs from human references.
- Some tasks require long or structured inputs that exceed certain models' context windows, causing missing scores.
When Not To Use
- When you need evaluations in non-English languages or multimodal tasks.
- When your application depends on domain-specific knowledge not covered by KoLA's datasets.
- If you require absolute, non-relative scoring without comparison to other models.
Failure Modes
- Self-contrast flags may miss novel correct facts that are valid but not present in references.
- Instruction tuning may improve reasoning but reduce raw memorization (alignment tax).
- Standardized scores depend on the model pool: adding or removing high performers shifts relative scores.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- InstructGPT davinci v2
- GPT-3 davinci v1
- Cohere-command
- FLAN-UL2
- FLAN-T5
- LLaMa
- Llama2-chat
- GLM-130B
- GPT-J
- GPT-NeoX
- BLOOM
- Vicuna
- Alpaca
- Tulu
- ChatGLM
- ChatGLM2-32k
- J2-Jumbo-Instruct
- RedPajama-Instruct
- Dolly-v2
- UL2
- GPT-JT
- Internlm-chat-8k
Metrics
- Standardized score (z then min-max to 0-100)
- Exact Match (EM)
- Token F1
- Accuracy
- ROUGE-L (used in self-contrast)
- Precision/Recall/F1 for RE and event tasks
Datasets
- Wikipedia (Wikidata5M subset)
- Evolving news corpus (seasonal)
- Evolving fiction corpus (AO3)
- HotpotQA
- 2WikiMultihopQA
- MuSiQue
- KQA Pro
- KoRC
- DocRED
- FewNERD
- MAVEN
- MAVEN-ERE
- COPEN
- Encyclopedic (MAVEN subset)
Benchmarks
- KoLA (this paper)
- LAMA-style probing
- HotpotQA
- DocRED
- KQA Pro
- MuSiQue

