Fine-tune an LLM to parse causal questions, call causal tools, and explain results end-to-end

December 28, 20237 min

Overview

Decision SnapshotNeeds Validation

The method shows clear empirical gains on synthetic benchmarks and human-evaluated interpretation tasks; gains depend on a curated fine-tuning dataset and a runnable code environment.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, Rui Song

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning an open LLM with targeted input-output examples turns it into a reliable causal assistant that extracts task details, runs analysis code, and explains results—reducing time to insight for analysts and lowering dependence on closed APIs.

Who Should Care

Summary TLDR

The authors fine-tune an open LLaMA-2 (7B) model with LoRA on two synthetic instruction datasets so it can (1) parse user causal questions into a structured JSON, (2) pick and run causal tool functions (graph learning, ATE/HTE, mediation, off-policy optimization), and (3) translate numeric outputs into plain-language interpretations. Evaluated on synthetic tasks, the tuned model (LLM4Causal) outperforms GPT-4 in extracting entities and delivering correct end-to-end answers. The paper releases the data-generation pipeline and two fine-tuning corpora (Causal-Retrieval-Bench, Causal-Interpret-Bench).

Problem Statement

General LLMs fail at data-backed causal tasks: they misidentify task type, hallucinate, or produce code that does not run. The paper aims to fine-tune an open LLM so it reliably (a) extracts task and input variables from a natural query, (b) calls appropriate causal-analysis functions, and (c) explains numerical results in easy language.

Main Contribution

Design of an end-to-end LLM pipeline (LLM4Causal) that parses causal queries, invokes causal algorithms, and explains outputs.

A three-stage synthetic data generation and human-annotation pipeline producing two fine-tuning corpora: Causal-Retrieval-Bench (1,500 pairs) and Causal-Interpret-Bench (400 human-refined interpretations).

Key Findings

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

NumbersWin rate avg 0.806 vs GPT4 avg ~0.12 (Table 2)

Practical UseFine-tuning a local LLaMA-2 with task-specific data makes an LLM reliably produce correct causal answers on the evaluated synthetic tasks; use fine-tuning rather than vanilla prompting for production causal assistants.

Evidence RefTable 2

Entity extraction (task, dataset, treatment, outcome, etc.) improved substantially after fine-tuning

NumbersCausal task accuracy 0.98 vs GPT4-turbo 0.69 (Table 3)

Practical UseIf your app needs exact JSON extraction from user queries, fine-tune on structured input-output pairs (Causal-Retrieval-Bench style) to cut extraction errors from ~30% to a few percent.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass Rate (LLM4Causal-Mixed)CGL 1.00, ATE 0.93, HTE 0.83, MA 0.86, OPO 0.83ChatGPT/GPT4: CGL 0.17, ATE 0.77, HTE 0.73, MA 0.27, OPO 0.87Notable increase in CGL and MA; mixed for OPOend-to-end synthetic tasks (Table 2)Table 2: Pass RateTable 2
Relevance Rate (LLM4Causal-Mixed)CGL 1.00, ATE 0.93, HTE 0.83, MA 0.80, OPO 0.83ChatGPT/GPT4: CGL 0.10, ATE 0.60, HTE 0.43, MA 0.20, OPO 0.43Large improvement in task identificationend-to-end synthetic tasks (Table 2)Table 2 Relevance RateTable 2

What To Try In 7 Days

Collect representative user queries and map them to desired JSON outputs; fine-tune a small LLaMA checkpoint with LoRA.

Build a two-stage dataset: (1) query→structured fields, (2) function output→plain-language interpretation and then fine-tune a single model.

Prototype a wrapper that maps parsed JSON to existing causal packages (econml, causal-learn) and returns templated summaries for user review.

Agent Features

Planning
map user intent → causal task → selected function
Tool Use
function callingautomated selection of causal packages (causal-learn, econml, causalml, CausalDM)
Frameworks
LoRAtemplated outcome summaries
Is Agentic

Yes

Architectures
LLaMA-2 (7B) base

Optimization Features

Model Optimization
LoRA
Training Optimization
fine-tune on synthetic plus human-refined pairs

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use synthetic datasets; real-world data and domain drift are not fully tested.

System depends on installed causal packages and a working execution environment; code errors break end-to-end delivery.

When Not To Use

When input datasets are proprietary and cannot be executed in the system's runtime environment.

When causal assumptions require domain-specific validations not encoded in the tool selection logic.

Failure Modes

Incorrect task classification leads to wrong analysis (e.g., returning correlation instead of ATE).

Generated code fails at runtime due to missing packages or environment mismatch.

Core Entities

Models

Llama-2 (7B)LLM4Causal-MixedLLM4Causal-RetrieveLLM4Causal-InterpretGPT-4GPT4-turbo

Metrics

Pass RateRelevance RateWin RateHallucination rateIncompleteness rateNon-fluency rate

Datasets

Causal-Retrieval-BenchCausal-Interpret-Benchsynthetic causal datasets (150 test files per task)

Benchmarks

end-to-end evaluation (Pass / Relevance / Win rates)causal entity extraction (exact/soft match)interpretation error rates (hallucination, incompleteness, non-fluency)