Overview
The benchmark gives clear, reproducible tasks and metrics, but results show LLMs are not yet a drop‑in replacement for dedicated causal discovery methods at medium and large scales.
Citations4
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
If you plan causal discovery at realistic scale, don't rely solely on LLMs — they can help with small problems and chain reasoning but miss structure on large sparse graphs and add many false edges.
Who Should Care
Summary TLDR
CausalBench is a public benchmark built from 15 real-world causal datasets (2–109 nodes) to test how well LLMs learn causal structure. It defines three tasks—correlation, causal skeleton, and causality identification—plus a Chain-of-Thought (CoT) style chain task. The authors evaluate 19 LLMs (open and closed source) and find: closed-source models (GPT family) beat open-source ones but still lag classic and SOTA causal algorithms at >50 nodes; LLMs are strong at long causal chains (CoT-like) and direct correlations but fail on collider patterns and large sparse graphs; background text and numeric training data help only when variable names are clear and models can read numbers. Practical up‑
Problem Statement
Current LLM causal evaluations are narrow: small networks, single prompt formats, and few models. That leaves unanswered whether LLMs can recover causal graphs at realistic scales and whether text, background knowledge, or raw training data help. CausalBench fills that gap with diverse datasets, four prompt formats, and three core tasks to compare many LLMs against classical causal algorithms.
Main Contribution
CausalBench: 15 real‑world causal datasets from bnlearn covering 2–109 nodes.
Three core tasks: correlation identification, causal skeleton (undirected graph), and causality identification (directed edges).
Key Findings
LLMs underperform classical and SOTA causal algorithms on medium and large graphs.
Closed‑source GPT models outperform open‑source LLMs across tasks and scales.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best LLM F1 (by dataset scale) | small: 0.7017, medium: 0.5825, large: 0.4310 | classical/SOTA algorithms | substantially lower at medium+ scales | aggregate by dataset scale (small/medium/large) | Section VI (A) summary; reported F1 declines with dataset size. | Section VI |
| Direct correlation F1 (example) | Earthquake 0.5673 vs Hailfinder 0.1257 | higher on small datasets | drop ≈ 0.44 | Table II (direct correlation) | Table II; Section III-C | Table II |
What To Try In 7 Days
Run the CoT chain prompt on small subgraphs to validate whether LLMs can chain local relations in your data.
Test 'Does A cause B?' style prompts and compare outputs to simple statistical CI tests to find high‑recall candidate edges.
For sensitive decisions, combine LLM edge proposals with classical causal discovery (PC/MMHC) and prune LLM dense graphs.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Benchmark uses bnlearn datasets of <100 nodes; Pathfinder (>100) excluded from main analysis.
Open‑source LLMs with large parameter counts (>30B) were sometimes excluded from numeric prompt experiments due to time/token limits.
When Not To Use
Do not use LLMs alone for causal discovery on graphs larger than ~50 nodes.
Do not feed raw large matrices to small open‑source LLMs expecting correct numeric reasoning.
Failure Modes
Dense, overconnected DAGs with many spurious edges (high in/out‑degree).
Sensitivity to prompt wording and variable name phrasing causing inconsistent outputs.

