CausalBench: a 15‑dataset benchmark to measure LLM causal learning from correlation to full causal graphs

April 9, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.5

Citation Count

4

Authors

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan

Links

Abstract / PDF

Why It Matters For Business

If you plan causal discovery at realistic scale, don't rely solely on LLMs — they can help with small problems and chain reasoning but miss structure on large sparse graphs and add many false edges.

Summary TLDR

CausalBench is a public benchmark built from 15 real-world causal datasets (2–109 nodes) to test how well LLMs learn causal structure. It defines three tasks—correlation, causal skeleton, and causality identification—plus a Chain-of-Thought (CoT) style chain task. The authors evaluate 19 LLMs (open and closed source) and find: closed-source models (GPT family) beat open-source ones but still lag classic and SOTA causal algorithms at >50 nodes; LLMs are strong at long causal chains (CoT-like) and direct correlations but fail on collider patterns and large sparse graphs; background text and numeric training data help only when variable names are clear and models can read numbers. Practical up‑

Problem Statement

Current LLM causal evaluations are narrow: small networks, single prompt formats, and few models. That leaves unanswered whether LLMs can recover causal graphs at realistic scales and whether text, background knowledge, or raw training data help. CausalBench fills that gap with diverse datasets, four prompt formats, and three core tasks to compare many LLMs against classical causal algorithms.

Main Contribution

CausalBench: 15 real‑world causal datasets from bnlearn covering 2–109 nodes.

Three core tasks: correlation identification, causal skeleton (undirected graph), and causality identification (directed edges).

Four prompt formats: variable names; variable + background knowledge; variable + training data (matrices); and all three combined.

Large evaluation: 19 LLM instances (GPT3.5/GPT4/GPT4‑Turbo plus multiple open‑source families) and baseline comparison to classical and SOTA causal methods.

Analysis of structure types (chain, collider, confounder), prompt robustness, variable‑name effects, and numeric data handling.

Key Findings

LLMs underperform classical and SOTA causal algorithms on medium and large graphs.

NumbersAt >50 nodes LLM methods often achieve <50% of classical/SOTA performance (reported averages).

Closed‑source GPT models outperform open‑source LLMs across tasks and scales.

NumbersBest LLM accuracies: small 65.28%, medium 74.70%, large 68.06%; higher than most open models (Section VI).

LLMs are relatively good at chain (long sequential) causal patterns and Chain‑of‑Thought style tasks.

NumbersMany LLMs (>6B params) achieve up to 100% on CoT‑analogous chain tasks with max inference length 24 (Table V).

LLMs struggle with collider structures and produce overly dense DAGs (many spurious edges).

NumbersCollider F1 ≈ 0.21 (all LLMs); generated graphs show much higher average in/out‑degree than classical methods (Table IV,

Adding background knowledge or training data helps only under specific conditions.

NumbersVariable+BG sometimes decreases F1 for open models; variable+BG+training data yields best results for GPT4‑Turbo (Table/

Prompt wording and variable name clarity materially change results.

NumbersPerformance varies across 5 prompt sentence templates; Type 3 ('Does A cause B?') gave highest F1 for causality (All LLM

Results

Best LLM F1 (by dataset scale)

Valuesmall: 0.7017, medium: 0.5825, large: 0.4310

Baselineclassical/SOTA algorithms

Direct correlation F1 (example)

ValueEarthquake 0.5673 vs Hailfinder 0.1257

Baselinehigher on small datasets

Accuracy

ValueMany LLMs (models >6B) reach 100% on chain tasks up to length 24

Baselinenon‑CoT prompts

Who Should Care

What To Try In 7 Days

Run the CoT chain prompt on small subgraphs to validate whether LLMs can chain local relations in your data.

Test 'Does A cause B?' style prompts and compare outputs to simple statistical CI tests to find high‑recall candidate edges.

For sensitive decisions, combine LLM edge proposals with classical causal discovery (PC/MMHC) and prune LLM dense graphs.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark uses bnlearn datasets of <100 nodes; Pathfinder (>100) excluded from main analysis.
  • Open‑source LLMs with large parameter counts (>30B) were sometimes excluded from numeric prompt experiments due to time/token limits.
  • No public code link provided in paper; reproduction requires rebuilding prompts and data splits.

When Not To Use

  • Do not use LLMs alone for causal discovery on graphs larger than ~50 nodes.
  • Do not feed raw large matrices to small open‑source LLMs expecting correct numeric reasoning.
  • Avoid relying on background knowledge prompts when variable names are ambiguous.

Failure Modes

  • Dense, overconnected DAGs with many spurious edges (high in/out‑degree).
  • Sensitivity to prompt wording and variable name phrasing causing inconsistent outputs.
  • Open models failing to parse or remember split numerical training data during multi‑turn input.

Core Entities

Models

  • GPT3.5-Turbo
  • GPT4
  • GPT4-Turbo
  • LLAMA-7B
  • LLAMA-13B
  • LLAMA-33B
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-66B
  • Falcon-7B
  • Falcon-40B
  • InternLM-7B
  • InternLM-20B
  • BERT-large
  • RoBERTa-large
  • DeBERTa-large
  • DistilBERT-mnli

Metrics

  • F1 score
  • Accuracy
  • Structural Hamming Distance (SHD)
  • Structural Intervention Distance (SID)
  • Edge sparsity / network sparsity
  • Average in/out-degree

Datasets

  • Asia
  • Cancer
  • Earthquake
  • Survey
  • Sachs
  • Child
  • Insurance
  • Water
  • Mildew
  • Alarm
  • Barley
  • Hailfinder
  • Hepar II
  • Win95PTS
  • Pathfinder