Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Overview

Decision SnapshotNeeds Validation

The approach is clear and interpretable: build per-user causal graphs, rank causal paths, retrieve foods, and run counterfactual checks. Evidence comes from simulated interventions on 34 users and blinded judge/human comparisons; clinical validation is still required.

Citations1

Evidence Strength0.60

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Zhongqi Yang, Amir Rahmani

Links

Abstract / PDF / Data

Why It Matters For Business

Personalized causal reasoning makes LLM-driven dietary advice measurably more tailored and stable for multi-hour glucose control; that can improve product trust and clinical usefulness compared to one-size-fits-all LLM responses.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The paper presents Personalized Causal Graph Reasoning: build an individualized causal graph from a person's glucose, activity, and meal logs, let an LLM traverse and rank causal paths, retrieve candidate foods, and verify suggestions by simulating counterfactuals on a fuller personal graph. Implemented for glucose control using CGM and meal records (34 users). Counterfactual evaluation shows large mean glucose reduction (MGR) gains at 1h and 2h versus retrieval baselines; LLM-as-a-judge (Llama-3 70B) preferred the method 98.43% of the time and humans 86.5%. Main caveats: evaluation is simulation-based (observational data), small cohort, and no prospective clinical trial.

Problem Statement

General LLMs give generic dietary advice because they reason from population-level correlations. This fails when individuals have unique metabolic patterns. The paper aims to make LLM recommendations individualized by reasoning over a person-specific causal graph built from their longitudinal data.

Main Contribution

Introduce Personalized Causal Graph Reasoning: combine a person's causal graph with an LLM that traverses and ranks causal paths to generate tailored interventions.

Implement the framework for dietary recommendations using CGM, activity, and meal logs; verify recommendations via counterfactual simulation on a full-data personal graph.

Key Findings

Personalized causal-graph method outperforms RAG baselines on longer horizons.

Numbers1h MGR 158.21 vs ChatDiet 120.45 (p=0.046); 2h MGR 411.56 vs 307.12 (p≈1e-4).

Practical UseIf you target 1–2 hour post-meal glucose, use personal causal graphs plus an LLM rather than static retrieval systems to get better simulated glucose reductions.

Evidence RefTable I

Outputs judged far more personalized by automatic and human judges.

NumbersLLM-as-a-judge win rate 98.43%; human win rate 86.50%.

Practical UseThe method produces reasoning and explanations that both models and humans perceive as substantially more personalized.

Evidence RefTable III

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
30 min MGR (mean [95% CI])	19.84 [9.12, 30.56]	ChatDiet 33.92 [20.70, 47.14]	Proposed lower than ChatDiet at 30m (but not significant vs proposed)	34 participants; recommendation queries per participant	Table I: 30-min window	Table I
1 hour MGR (mean [95% CI])	158.21 [137.40, 179.02]	ChatDiet 120.45 [90.11, 150.79]	+37.76 vs ChatDiet (p=0.046)	34 participants	Table I: 1-hour window	Table I

What To Try In 7 Days

Build a tiny personal causal graph for one test user from a week of CGM+meal logs and see if LLM-guided recommendations differ from your current rules.

Add a path-ranking step (edge weights × historical usage) and compare top-5 nutrient drivers versus simple correlation ranking.

Implement a counterfactual simulator on full-user data to validate a small set of recommendations before rollout.

Agent Features

Memory

uses longitudinal personal data (CGM, meal logs, MET)

Planning

graph traversal to find causal pathspath ranking to prioritize interventions

Tool Use

external food nutrient database retrievalcounterfactual simulation using a full-data causal graph

Frameworks

Personalized Causal Graph Reasoning

Is Agentic

Yes

Architectures

LLM + per-user causal graphRAG-style external retrieval

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

CGMacros (referenced as public dataset; exact URL not provided in paper)

Risks & Boundaries

Limitations

Evaluation uses counterfactual simulation on observational data, not prospective clinical trials.

Small cohort (34 users used) limits population generality.

When Not To Use

No or very limited longitudinal personal data available for a user.

Acute clinical decision-making where randomized trial evidence is required.

Failure Modes

Incorrect causal edges from PC algorithm due to confounding produce bad recommendations.

LLM misreads or misapplies the causal summary and suggests inconsistent foods.

Core Entities

Models

GPT-4oLlama-3 70B

Metrics

Mean Glucose Reduction (MGR)iAUC (incremental Area Under Curve)

Datasets

CGMacros (personal CGM + meal + MET data, 34 participants used)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Personalized causal-graph method outperforms RAG baselines on longer horizons.

Outputs judged far more personalized by automatic and human judges.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding