Fine-tune an LLM to parse causal questions, call causal tools, and explain results end-to-end

Overview

Decision SnapshotNeeds Validation

The method shows clear empirical gains on synthetic benchmarks and human-evaluated interpretation tasks; gains depend on a curated fine-tuning dataset and a runnable code environment.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, Rui Song

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning an open LLM with targeted input-output examples turns it into a reliable causal assistant that extracts task details, runs analysis code, and explains results—reducing time to insight for analysts and lowering dependence on closed APIs.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors fine-tune an open LLaMA-2 (7B) model with LoRA on two synthetic instruction datasets so it can (1) parse user causal questions into a structured JSON, (2) pick and run causal tool functions (graph learning, ATE/HTE, mediation, off-policy optimization), and (3) translate numeric outputs into plain-language interpretations. Evaluated on synthetic tasks, the tuned model (LLM4Causal) outperforms GPT-4 in extracting entities and delivering correct end-to-end answers. The paper releases the data-generation pipeline and two fine-tuning corpora (Causal-Retrieval-Bench, Causal-Interpret-Bench).

Problem Statement

General LLMs fail at data-backed causal tasks: they misidentify task type, hallucinate, or produce code that does not run. The paper aims to fine-tune an open LLM so it reliably (a) extracts task and input variables from a natural query, (b) calls appropriate causal-analysis functions, and (c) explains numerical results in easy language.

Main Contribution

Design of an end-to-end LLM pipeline (LLM4Causal) that parses causal queries, invokes causal algorithms, and explains outputs.

A three-stage synthetic data generation and human-annotation pipeline producing two fine-tuning corpora: Causal-Retrieval-Bench (1,500 pairs) and Causal-Interpret-Bench (400 human-refined interpretations).

Key Findings

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

NumbersWin rate avg 0.806 vs GPT4 avg ~0.12 (Table 2)

Practical UseFine-tuning a local LLaMA-2 with task-specific data makes an LLM reliably produce correct causal answers on the evaluated synthetic tasks; use fine-tuning rather than vanilla prompting for production causal assistants.

Evidence RefTable 2

Entity extraction (task, dataset, treatment, outcome, etc.) improved substantially after fine-tuning

NumbersCausal task accuracy 0.98 vs GPT4-turbo 0.69 (Table 3)

Practical UseIf your app needs exact JSON extraction from user queries, fine-tune on structured input-output pairs (Causal-Retrieval-Bench style) to cut extraction errors from ~30% to a few percent.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass Rate (LLM4Causal-Mixed)	CGL 1.00, ATE 0.93, HTE 0.83, MA 0.86, OPO 0.83	ChatGPT/GPT4: CGL 0.17, ATE 0.77, HTE 0.73, MA 0.27, OPO 0.87	Notable increase in CGL and MA; mixed for OPO	end-to-end synthetic tasks (Table 2)	Table 2: Pass Rate	Table 2
Relevance Rate (LLM4Causal-Mixed)	CGL 1.00, ATE 0.93, HTE 0.83, MA 0.80, OPO 0.83	ChatGPT/GPT4: CGL 0.10, ATE 0.60, HTE 0.43, MA 0.20, OPO 0.43	Large improvement in task identification	end-to-end synthetic tasks (Table 2)	Table 2 Relevance Rate	Table 2

What To Try In 7 Days

Collect representative user queries and map them to desired JSON outputs; fine-tune a small LLaMA checkpoint with LoRA.

Build a two-stage dataset: (1) query→structured fields, (2) function output→plain-language interpretation and then fine-tune a single model.

Prototype a wrapper that maps parsed JSON to existing causal packages (econml, causal-learn) and returns templated summaries for user review.

Agent Features

Planning

map user intent → causal task → selected function

Tool Use

function callingautomated selection of causal packages (causal-learn, econml, causalml, CausalDM)

Frameworks

LoRAtemplated outcome summaries

Is Agentic

Yes

Architectures

LLaMA-2 (7B) base

Optimization Features

Model Optimization

LoRA

Training Optimization

fine-tune on synthetic plus human-refined pairs

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use synthetic datasets; real-world data and domain drift are not fully tested.

System depends on installed causal packages and a working execution environment; code errors break end-to-end delivery.

When Not To Use

When input datasets are proprietary and cannot be executed in the system's runtime environment.

When causal assumptions require domain-specific validations not encoded in the tool selection logic.

Failure Modes

Incorrect task classification leads to wrong analysis (e.g., returning correlation instead of ATE).

Generated code fails at runtime due to missing packages or environment mismatch.

Core Entities

Models

Llama-2 (7B)LLM4Causal-MixedLLM4Causal-RetrieveLLM4Causal-InterpretGPT-4GPT4-turbo

Metrics

Pass RateRelevance RateWin RateHallucination rateIncompleteness rateNon-fluency rate

Datasets

Causal-Retrieval-BenchCausal-Interpret-Benchsynthetic causal datasets (150 test files per task)

Benchmarks

end-to-end evaluation (Pass / Relevance / Win rates)causal entity extraction (exact/soft match)interpretation error rates (hallucination, incompleteness, non-fluency)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

Entity extraction (task, dataset, treatment, outcome, etc.) improved substantially after fine-tuning

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding