Add causal graphs and what-if checks to RAG to reduce hallucinations and improve causal answers

September 17, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.3

Citation Count

0

Authors

Harshad Khadilkar, Abhay Gupta

Links

Abstract / PDF

Why It Matters For Business

If your product needs trustworthy causal answers (for diagnostics, policy, medical reasoning, or financial analysis), adding causal graphs plus counterfactual checks can cut incorrect causal claims and improve interpretability. Expect higher compute and latency costs.

Summary TLDR

This paper builds a Retrieval-Augmented Generation (RAG) pipeline that stores cause-effect pairs in a causal knowledge graph (CKG), retrieves candidates with a two-stage vector+LLM check, and then runs programmatic counterfactual simulations to test whether retrieved causes are truly necessary. On their evaluations, this approach raises precision and causal reasoning scores versus a standard semantic-similarity RAG, at the cost of extra LLM calls and higher latency.

Problem Statement

Standard RAG fetches text by semantic similarity, which often returns superficially relevant but causally incorrect information. RAG systems lack explicit causal grounding and rarely test counterfactuals, so they can produce plausible-looking but unreliable causal claims.

Main Contribution

A pipeline (Causal-Counterfactual RAG) that constructs a Causal Knowledge Graph (CKG) from documents and stores traceable cause-effect pairs.

A two-stage retrieval: fast vector search followed by LLM-based semantic+polarity verification to avoid context-mismatched matches.

A counterfactual validation loop that programmatically generates plausible opposites of causes, simulates downstream effects in the graph, and uses an LLM to synthesize whether a cause is necessary.

Key Findings

Causal-Counterfactual RAG yields substantially higher precision than Regular RAG on evaluated benchmarks.

NumbersPrecision: 80.57 vs 60.13 (Regular RAG)

Causal-Counterfactual RAG improves causal reasoning metrics over Regular RAG.

NumbersCausal Chain Integrity Score: 75.58 vs 53.62; Counterfactual Robustness: 69.90 vs 49.12

The counterfactual validation loop increases computational cost and latency and inherits LLM error risks.

Results

Precision

Value80.57

BaselineRegular RAG: 60.13

Recall

Value78.18

BaselineRegular RAG: 74.58

Causal Chain Integrity Score (CIS)

Value75.58

BaselineRegular RAG: 53.62

Counterfactual Robustness Score (CRS)

Value69.90

BaselineRegular RAG: 49.12

Who Should Care

What To Try In 7 Days

Build a tiny Causal Knowledge Graph from 100 domain docs using an embedding model and store (cause,effect) pairs.

Add a two-stage retrieval: vector nearest neighbors then a small LLM prompt to verify polarity and semantic match.

Implement one counterfactual check per query: generate a plausible opposite of a top cause and re-run retrieval to see if the outcome persists.

Agent Features

Memory

  • retrieval memory

Tool Use

  • LLMs for extraction and verification
  • vector DB (Neo4j) for fast nearest neighbor search
  • embedding models for semantic encoding

Frameworks

  • RAG
  • Causal Knowledge Graph (CKG)

Architectures

  • retrieval+LLM pipeline
  • knowledge-graph-backed retrieval

Optimization Features

Infra Optimization

  • use of vector index (Neo4j) for fast search; judge LLM deployed via Groq

Reproducibility

Data Urls

  • OpenAlex (used as corpus)

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on LLMs to construct the CKG; errors can enshrine false causal links.
  • Counterfactual generation can produce implausible alternatives, corrupting validation.
  • Counterfactual loop adds many extra LLM calls and increases query latency.
  • Evaluation uses custom dataset and LLM judge; external benchmark validation is limited.

When Not To Use

  • When strict low-latency, real-time responses are required.
  • For simple fact lookups where semantic retrieval suffices.
  • If you lack budget for repeated LLM calls and graph maintenance.

Failure Modes

  • Graph contains fabricated or misinterpreted cause-effect pairs -> wrong 'ground truth'.
  • LLM generates illogical counterfactuals -> wrong necessity judgments.
  • High computational cost makes the pipeline impractical for high-throughput systems.

Core Entities

Models

  • Gemini 1.5
  • SentenceTransformer all-MiniLM-L6-v2
  • LLaMA-3.1-8B-Instant

Metrics

  • Precision
  • Recall
  • Causal Chain Integrity Score (CCIS/CIS)
  • Counterfactual Robustness Score (CRS)

Datasets

  • OpenAlex corpus
  • custom causal QA dataset (generated per-document)