Fine-tune an LLM to parse causal questions, call causal tools, and explain results end-to-end

December 28, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, Rui Song

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning an open LLM with targeted input-output examples turns it into a reliable causal assistant that extracts task details, runs analysis code, and explains results—reducing time to insight for analysts and lowering dependence on closed APIs.

Summary TLDR

The authors fine-tune an open LLaMA-2 (7B) model with LoRA on two synthetic instruction datasets so it can (1) parse user causal questions into a structured JSON, (2) pick and run causal tool functions (graph learning, ATE/HTE, mediation, off-policy optimization), and (3) translate numeric outputs into plain-language interpretations. Evaluated on synthetic tasks, the tuned model (LLM4Causal) outperforms GPT-4 in extracting entities and delivering correct end-to-end answers. The paper releases the data-generation pipeline and two fine-tuning corpora (Causal-Retrieval-Bench, Causal-Interpret-Bench).

Problem Statement

General LLMs fail at data-backed causal tasks: they misidentify task type, hallucinate, or produce code that does not run. The paper aims to fine-tune an open LLM so it reliably (a) extracts task and input variables from a natural query, (b) calls appropriate causal-analysis functions, and (c) explains numerical results in easy language.

Main Contribution

Design of an end-to-end LLM pipeline (LLM4Causal) that parses causal queries, invokes causal algorithms, and explains outputs.

A three-stage synthetic data generation and human-annotation pipeline producing two fine-tuning corpora: Causal-Retrieval-Bench (1,500 pairs) and Causal-Interpret-Bench (400 human-refined interpretations).

Empirical eval on synthetic datasets showing large gains in entity extraction and end-to-end task accuracy versus GPT-4 baselines.

Key Findings

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

NumbersWin rate avg 0.806 vs GPT4 avg ~0.12 (Table 2)

Entity extraction (task, dataset, treatment, outcome, etc.) improved substantially after fine-tuning

NumbersCausal task accuracy 0.98 vs GPT4-turbo 0.69 (Table 3)

Raw GPT-4 interpretations hallucinate frequently; tuned model reduces but not eliminates errors

NumbersGPT‑4 produced ~25% hallucination in interpretations; tuned models show lower hallucination rates (varies by task)

Results

Pass Rate (LLM4Causal-Mixed)

ValueCGL 1.00, ATE 0.93, HTE 0.83, MA 0.86, OPO 0.83

BaselineChatGPT/GPT4: CGL 0.17, ATE 0.77, HTE 0.73, MA 0.27, OPO 0.87

Relevance Rate (LLM4Causal-Mixed)

ValueCGL 1.00, ATE 0.93, HTE 0.83, MA 0.80, OPO 0.83

BaselineChatGPT/GPT4: CGL 0.10, ATE 0.60, HTE 0.43, MA 0.20, OPO 0.43

Win Rate (final correct result)

ValueLLM4Causal-Mixed per-task: CGL 0.90, ATE 0.90, HTE 0.80, MA 0.70, OPO 0.73; avg 0.806

BaselineChatGPT/GPT4 per-task: CGL 0.00, ATE 0.37, HTE 0.07, MA 0.10, OPO 0.07; avg ~0.12

Accuracy

ValueCausal task 0.98, Dataset 1.00, Nodes 1.00, Treatment 0.96, Response 0.97, Mediator 1.00, Condition 1.00, All 0.98

BaselineGPT4-turbo All 0.77; task 0.69

Interpretation hallucination

ValueGPT-4 ~0.25 hallucination (paper statement); tuned models show lower hallucination depending on task

BaselineGPT-4 interpretation error noted as ~25%

Who Should Care

What To Try In 7 Days

Collect representative user queries and map them to desired JSON outputs; fine-tune a small LLaMA checkpoint with LoRA.

Build a two-stage dataset: (1) query→structured fields, (2) function output→plain-language interpretation and then fine-tune a single model.

Prototype a wrapper that maps parsed JSON to existing causal packages (econml, causal-learn) and returns templated summaries for user review.

Agent Features

Planning

  • map user intent → causal task → selected function

Tool Use

  • function calling
  • automated selection of causal packages (causal-learn, econml, causalml, CausalDM)

Frameworks

  • LoRA
  • templated outcome summaries

Is Agentic

true

Architectures

  • LLaMA-2 (7B) base

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • fine-tune on synthetic plus human-refined pairs

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluations use synthetic datasets; real-world data and domain drift are not fully tested.
  • System depends on installed causal packages and a working execution environment; code errors break end-to-end delivery.
  • Interpretation still shows non-zero hallucination and incompleteness; human review remains needed for critical decisions.
  • Work focuses on five causal tasks; other causal analyses are not supported out of the box.

When Not To Use

  • When input datasets are proprietary and cannot be executed in the system's runtime environment.
  • When causal assumptions require domain-specific validations not encoded in the tool selection logic.
  • For high-stakes decisions where any hallucination is unacceptable without human verification.

Failure Modes

  • Incorrect task classification leads to wrong analysis (e.g., returning correlation instead of ATE).
  • Generated code fails at runtime due to missing packages or environment mismatch.
  • Interpretation uses causal language incorrectly or inserts unsupported claims (hallucination).

Core Entities

Models

  • Llama-2 (7B)
  • LLM4Causal-Mixed
  • LLM4Causal-Retrieve
  • LLM4Causal-Interpret
  • GPT-4
  • GPT4-turbo

Metrics

  • Pass Rate
  • Relevance Rate
  • Win Rate
  • Hallucination rate
  • Incompleteness rate
  • Non-fluency rate

Datasets

  • Causal-Retrieval-Bench
  • Causal-Interpret-Bench
  • synthetic causal datasets (150 test files per task)

Benchmarks

  • end-to-end evaluation (Pass / Relevance / Win rates)
  • causal entity extraction (exact/soft match)
  • interpretation error rates (hallucination, incompleteness, non-fluency)