CausalBench: a 15‑dataset benchmark to measure LLM causal learning from correlation to full causal graphs

April 9, 20248 min

Overview

Decision SnapshotNeeds Validation

The benchmark gives clear, reproducible tasks and metrics, but results show LLMs are not yet a drop‑in replacement for dedicated causal discovery methods at medium and large scales.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 30%

Authors

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan

Links

Abstract / PDF / Data

Why It Matters For Business

If you plan causal discovery at realistic scale, don't rely solely on LLMs — they can help with small problems and chain reasoning but miss structure on large sparse graphs and add many false edges.

Who Should Care

Summary TLDR

CausalBench is a public benchmark built from 15 real-world causal datasets (2–109 nodes) to test how well LLMs learn causal structure. It defines three tasks—correlation, causal skeleton, and causality identification—plus a Chain-of-Thought (CoT) style chain task. The authors evaluate 19 LLMs (open and closed source) and find: closed-source models (GPT family) beat open-source ones but still lag classic and SOTA causal algorithms at >50 nodes; LLMs are strong at long causal chains (CoT-like) and direct correlations but fail on collider patterns and large sparse graphs; background text and numeric training data help only when variable names are clear and models can read numbers. Practical up‑

Problem Statement

Current LLM causal evaluations are narrow: small networks, single prompt formats, and few models. That leaves unanswered whether LLMs can recover causal graphs at realistic scales and whether text, background knowledge, or raw training data help. CausalBench fills that gap with diverse datasets, four prompt formats, and three core tasks to compare many LLMs against classical causal algorithms.

Main Contribution

CausalBench: 15 real‑world causal datasets from bnlearn covering 2–109 nodes.

Three core tasks: correlation identification, causal skeleton (undirected graph), and causality identification (directed edges).

Key Findings

LLMs underperform classical and SOTA causal algorithms on medium and large graphs.

NumbersAt >50 nodes LLM methods often achieve <50% of classical/SOTA performance (reported averages).

Practical UseDo not replace classical causal discovery tools with LLMs for graphs larger than ~50 nodes; use LLMs only as a lightweight aid or for small subgraphs.

Evidence RefSection V-D, Table XI; Conclusion

Closed‑source GPT models outperform open‑source LLMs across tasks and scales.

NumbersBest LLM accuracies: small 65.28%, medium 74.70%, large 68.06%; higher than most open models (Section VI).

Practical UseExpect better causal answers from latest closed APIs; if using open models, validate heavily and focus on small/medium problems.

Evidence RefSection VI (A); Section III-E results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Best LLM F1 (by dataset scale)small: 0.7017, medium: 0.5825, large: 0.4310classical/SOTA algorithmssubstantially lower at medium+ scalesaggregate by dataset scale (small/medium/large)Section VI (A) summary; reported F1 declines with dataset size.Section VI
Direct correlation F1 (example)Earthquake 0.5673 vs Hailfinder 0.1257higher on small datasetsdrop ≈ 0.44Table II (direct correlation)Table II; Section III-CTable II

What To Try In 7 Days

Run the CoT chain prompt on small subgraphs to validate whether LLMs can chain local relations in your data.

Test 'Does A cause B?' style prompts and compare outputs to simple statistical CI tests to find high‑recall candidate edges.

For sensitive decisions, combine LLM edge proposals with classical causal discovery (PC/MMHC) and prune LLM dense graphs.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark uses bnlearn datasets of <100 nodes; Pathfinder (>100) excluded from main analysis.

Open‑source LLMs with large parameter counts (>30B) were sometimes excluded from numeric prompt experiments due to time/token limits.

When Not To Use

Do not use LLMs alone for causal discovery on graphs larger than ~50 nodes.

Do not feed raw large matrices to small open‑source LLMs expecting correct numeric reasoning.

Failure Modes

Dense, overconnected DAGs with many spurious edges (high in/out‑degree).

Sensitivity to prompt wording and variable name phrasing causing inconsistent outputs.

Core Entities

Models

GPT3.5-TurboGPT4GPT4-TurboLLAMA-7BLLAMA-13BLLAMA-33BOPT-1.3BOPT-2.7BOPT-6.7BOPT-66BFalcon-7BFalcon-40BInternLM-7BInternLM-20BBERT-largeRoBERTa-largeDeBERTa-largeDistilBERT-mnli

Metrics

F1 scoreAccuracyStructural Hamming Distance (SHD)Structural Intervention Distance (SID)Edge sparsity / network sparsityAverage in/out-degree

Datasets

AsiaCancerEarthquakeSurveySachsChildInsuranceWaterMildewAlarmBarleyHailfinderHepar IIWin95PTSPathfinder