CausalBench: a 15‑dataset benchmark to measure LLM causal learning from correlation to full causal graphs

Overview

Decision SnapshotNeeds Validation

The benchmark gives clear, reproducible tasks and metrics, but results show LLMs are not yet a drop‑in replacement for dedicated causal discovery methods at medium and large scales.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 30%

Authors

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan

Links

Abstract / PDF / Data

Why It Matters For Business

If you plan causal discovery at realistic scale, don't rely solely on LLMs — they can help with small problems and chain reasoning but miss structure on large sparse graphs and add many false edges.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

CausalBench is a public benchmark built from 15 real-world causal datasets (2–109 nodes) to test how well LLMs learn causal structure. It defines three tasks—correlation, causal skeleton, and causality identification—plus a Chain-of-Thought (CoT) style chain task. The authors evaluate 19 LLMs (open and closed source) and find: closed-source models (GPT family) beat open-source ones but still lag classic and SOTA causal algorithms at >50 nodes; LLMs are strong at long causal chains (CoT-like) and direct correlations but fail on collider patterns and large sparse graphs; background text and numeric training data help only when variable names are clear and models can read numbers. Practical up‑

Problem Statement

Current LLM causal evaluations are narrow: small networks, single prompt formats, and few models. That leaves unanswered whether LLMs can recover causal graphs at realistic scales and whether text, background knowledge, or raw training data help. CausalBench fills that gap with diverse datasets, four prompt formats, and three core tasks to compare many LLMs against classical causal algorithms.

Main Contribution

CausalBench: 15 real‑world causal datasets from bnlearn covering 2–109 nodes.

Three core tasks: correlation identification, causal skeleton (undirected graph), and causality identification (directed edges).

Key Findings

LLMs underperform classical and SOTA causal algorithms on medium and large graphs.

NumbersAt >50 nodes LLM methods often achieve <50% of classical/SOTA performance (reported averages).

Practical UseDo not replace classical causal discovery tools with LLMs for graphs larger than ~50 nodes; use LLMs only as a lightweight aid or for small subgraphs.

Evidence RefSection V-D, Table XI; Conclusion

Closed‑source GPT models outperform open‑source LLMs across tasks and scales.

NumbersBest LLM accuracies: small 65.28%, medium 74.70%, large 68.06%; higher than most open models (Section VI).

Practical UseExpect better causal answers from latest closed APIs; if using open models, validate heavily and focus on small/medium problems.

Evidence RefSection VI (A); Section III-E results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Best LLM F1 (by dataset scale)	small: 0.7017, medium: 0.5825, large: 0.4310	classical/SOTA algorithms	substantially lower at medium+ scales	aggregate by dataset scale (small/medium/large)	Section VI (A) summary; reported F1 declines with dataset size.	Section VI
Direct correlation F1 (example)	Earthquake 0.5673 vs Hailfinder 0.1257	higher on small datasets	drop ≈ 0.44	Table II (direct correlation)	Table II; Section III-C	Table II

What To Try In 7 Days

Run the CoT chain prompt on small subgraphs to validate whether LLMs can chain local relations in your data.

Test 'Does A cause B?' style prompts and compare outputs to simple statistical CI tests to find high‑recall candidate edges.

For sensitive decisions, combine LLM edge proposals with classical causal discovery (PC/MMHC) and prune LLM dense graphs.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.bnlearn.com/

Risks & Boundaries

Limitations

Benchmark uses bnlearn datasets of <100 nodes; Pathfinder (>100) excluded from main analysis.

Open‑source LLMs with large parameter counts (>30B) were sometimes excluded from numeric prompt experiments due to time/token limits.

When Not To Use

Do not use LLMs alone for causal discovery on graphs larger than ~50 nodes.

Do not feed raw large matrices to small open‑source LLMs expecting correct numeric reasoning.

Failure Modes

Dense, overconnected DAGs with many spurious edges (high in/out‑degree).

Sensitivity to prompt wording and variable name phrasing causing inconsistent outputs.

Core Entities

Models

GPT3.5-TurboGPT4GPT4-TurboLLAMA-7BLLAMA-13BLLAMA-33BOPT-1.3BOPT-2.7BOPT-6.7BOPT-66BFalcon-7BFalcon-40BInternLM-7BInternLM-20BBERT-largeRoBERTa-largeDeBERTa-largeDistilBERT-mnli

Metrics

F1 scoreAccuracyStructural Hamming Distance (SHD)Structural Intervention Distance (SID)Edge sparsity / network sparsityAverage in/out-degree

Datasets

AsiaCancerEarthquakeSurveySachsChildInsuranceWaterMildewAlarmBarleyHailfinderHepar IIWin95PTSPathfinder

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs underperform classical and SOTA causal algorithms on medium and large graphs.

Closed‑source GPT models outperform open‑source LLMs across tasks and scales.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding