Turn SHAP attributions into a reusable knowledge base and train an LLM to reason with it for more accurate, auditable sarcopenia diagnosis.

July 26, 20259 min

Overview

Decision SnapshotNeeds Validation

Proof-of-concept with clear engineering contributions. Results rely on a single dataset and limited repeated runs (n=10). Good signs for reproducibility and interpretability, but further multi-center validation and code release are needed before clinical production.

Citations0

Evidence Strength0.60

Confidence0.72

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 65%

Authors

Yuqi Jin, Zhenhao Shuai, Zihan Hu, Weiteng Zhang, Weihao Xie, Jianwei Shuai, Xian Shen, Zhen Feng

Links

Abstract / PDF / Data

Why It Matters For Business

CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.

Who Should Care

Summary TLDR

CANDLE converts XGBoost+SHAP attributions into a compact, retrievable 'Average Contribution Probability Base' (ACPB), uses an actor-critic loop to train an LLM to adjust feature weights, stores distilled case texts in a Diagnosis Knowledge Base (DKB), and at test time retrieves precedents to produce an interpretable sarcopenia prediction. On the evaluated dataset it increased accuracy by ~6 percentage points vs. XGBoost while keeping high reproducibility.

Problem Statement

Clinical tasks with structured data need both accurate predictions and traceable, auditable reasoning. Standard ML gives traceable feature attributions but lacks semantic reasoning. LLMs provide reasoning but are opaque and unstable. The paper asks: can we translate ML attribution logic into a form LLMs can internalize and distill, to gain both improved accuracy and interpretable, reproducible clinical decisions?

Main Contribution

ACPB: a compact Average Contribution Probability Base that converts SHAP values into interval-based contribution probabilities that can be queried without re-running SHAP.

Actor-Critic cross-modal distillation: an RL loop where an LLM (Actor) proposes adjusted feature weights and a Composite Reward (Critic) drives alignment to XGBoost probabilities until within 5%.

Key Findings

CANDLE increased overall accuracy compared to the XGBoost baseline.

NumbersAccuracy: CANDLE (LLM with DKB) 79.3% vs XGBoost 73.3% (+6.0 percentage points) (Table 1).

Practical UseIf you already use an XGBoost teacher model and SHAP attributions, applying ACPB+LLM distillation can yield modest accuracy gains (~6pp) while keeping attribution logic available for audits.

Evidence RefTable 1

Predictions from the distilled LLM are highly reproducible under fixed prompts.

NumbersProbabilities ICC = 0.956; Cohen's kappa = 0.842; sample-level label consistency = 95.5% (Table 2).

Practical UseFor deployment requiring repeatability, this hybrid preserves very stable outputs under identical inputs and prompts — helpful for audits and clinical traceability.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy79.3%XGBoost 73.3%+6.0 percentage pointsevaluated dataset (sarcopenia prediction)Table 1 shows CANDLE (LLM with DKB) accuracy 79.3% vs XGBoost 73.3%.Table 1
AUC (mean ± sd)0.760 ± 0.010not explicitly compared for AUCrepeated runs under fixed prompts (N=10)Table 3 reports AUC 0.760 ± 0.010 across repeated runs.Table 3

What To Try In 7 Days

Extract SHAP attributions from your existing XGBoost/ensemble model, aggregate them into simple interval buckets for frequent features (prototype ACPB).

Run a small LLM prompt that consumes those atomic facts and asks for adjusted feature weights; compare the LLM-inferred probability vs model output and log discrepancies.

Set up a tiny retrieval store (e.g., Qdrant) with a few distilled case texts and test retrieval-augmented prompts to observe changes in decision wording and threshold behavior.

Agent Features

Memory
Diagnosis Knowledge Base (DKB) storing distilled texts and weight sets
Planning
Iterative weight adjustment using Composite Reward until probability alignment (<=5% diff)
Tool Use
SHAP (explainability)Qdrant (vector DB)BGE embeddings + reranker
Frameworks
RLRetrieval-Augmented Generation (RAG)
Is Agentic

Yes

Architectures
LLM as Actor in Actor-Critic loopRetrieval-augmented LLM pipeline (LLM + DKB)
Collaboration

Single-agent LLM interacts with stored attribution library and critic; no multi-agent coordination r

Optimization Features

Token Efficiency
ACPB encodes SHAP as compact atomic facts to reduce token usage in prompts
Model Optimization
Distillation of attribution logic into compact weight sets and diagnostic texts stored in DKB
System Optimization
Early exit from RL loop when infer probability within 5% of XGBoost
Training Optimization
Actor-Critic loop to align LLM-inferred probabilities to teacher (XGBoost) outputs
Inference Optimization

ACPB avoids per-sample SHAP recomputation; retrieval of distilled cases reduces prompt engineering c

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation limited to sarcopenia; generalizability to other diseases untested.

Framework tested with XGBoost + SHAP only; behavior with other models or attribution methods (LIME, IG) is unproven.

When Not To Use

When regulatory or legal requirements forbid any LLM-generated intermediate reasoning (audit logs alone are insufficient).

For tasks where teacher model is poor or highly biased — distillation can propagate those failures.

Failure Modes

LLM adjusts weights in medically inconsistent ways if retrieval or ACPB signals are noisy.

Prompt variations shift decision threshold causing unstable clinical trade-offs.

Core Entities

Models

XGBoostQwen3-plus (LLM)Deepseek-R1-8B (LLM)ACPB (Average Contribution Probability Base)DKB (Diagnosis Knowledge Base)

Metrics

AccuracyAUCPrecisionRecallF1-scoreIntraclass Correlation Coefficient (ICC)Cohen's kappaMean absolute error (bias)

Datasets

CHARLS (China Health and Retirement Longitudinal Study)NHANES (referenced for public availability in Data Availability)