Turn SHAP attributions into a reusable knowledge base and train an LLM to reason with it for more accurate, auditable sarcopenia diagnosis.

July 26, 20259 min

Overview

Production Readiness

0.4

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

0

Authors

Yuqi Jin, Zhenhao Shuai, Zihan Hu, Weiteng Zhang, Weihao Xie, Jianwei Shuai, Xian Shen, Zhen Feng

Links

Abstract / PDF

Why It Matters For Business

CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.

Summary TLDR

CANDLE converts XGBoost+SHAP attributions into a compact, retrievable 'Average Contribution Probability Base' (ACPB), uses an actor-critic loop to train an LLM to adjust feature weights, stores distilled case texts in a Diagnosis Knowledge Base (DKB), and at test time retrieves precedents to produce an interpretable sarcopenia prediction. On the evaluated dataset it increased accuracy by ~6 percentage points vs. XGBoost while keeping high reproducibility.

Problem Statement

Clinical tasks with structured data need both accurate predictions and traceable, auditable reasoning. Standard ML gives traceable feature attributions but lacks semantic reasoning. LLMs provide reasoning but are opaque and unstable. The paper asks: can we translate ML attribution logic into a form LLMs can internalize and distill, to gain both improved accuracy and interpretable, reproducible clinical decisions?

Main Contribution

ACPB: a compact Average Contribution Probability Base that converts SHAP values into interval-based contribution probabilities that can be queried without re-running SHAP.

Actor-Critic cross-modal distillation: an RL loop where an LLM (Actor) proposes adjusted feature weights and a Composite Reward (Critic) drives alignment to XGBoost probabilities until within 5%.

DKB + FGMR retrieval: store distilled diagnostic texts and weight sets; retrieve similar cases via Feature-Grouped Multi-Round Retrieval (FGMR) and use them as prompt context for final predictions.

Empirical proof-of-concept for sarcopenia: CANDLE improves accuracy and reports consistency metrics (ICC, Cohen's kappa, AUC) showing reproducible outputs under fixed prompts.

Key Findings

CANDLE increased overall accuracy compared to the XGBoost baseline.

NumbersAccuracy: CANDLE (LLM with DKB) 79.3% vs XGBoost 73.3% (+6.0 percentage points) (Table 1).

Predictions from the distilled LLM are highly reproducible under fixed prompts.

NumbersProbabilities ICC = 0.956; Cohen's kappa = 0.842; sample-level label consistency = 95.5% (Table 2).

Discrimination (ranking ability) stayed stable across runs while operating point shifted with prompt framing.

NumbersAUC = 0.760 ± 0.010 across repeated runs; optimal threshold 0.697 ± 0.037 (Table 3); prompt styles changed FPR/TPR (e.g.

ACPB recreates XGBoost probabilities with very low bias when averaged.

NumbersMean absolute difference between XGBoost outputs and CACS-inferred probabilities = 0.017 (Table 5).

Results

Accuracy

Value79.3%

BaselineXGBoost 73.3%

AUC (mean ± sd)

Value0.760 ± 0.010

Baselinenot explicitly compared for AUC

Probability output reproducibility

ValueICC = 0.956

Binary classification reproducibility

ValueCohen's kappa = 0.842; sample-level consistency = 95.5%

ACPB vs XGBoost probability bias

ValueMean absolute difference = 0.017

Who Should Care

What To Try In 7 Days

Extract SHAP attributions from your existing XGBoost/ensemble model, aggregate them into simple interval buckets for frequent features (prototype ACPB).

Run a small LLM prompt that consumes those atomic facts and asks for adjusted feature weights; compare the LLM-inferred probability vs model output and log discrepancies.

Set up a tiny retrieval store (e.g., Qdrant) with a few distilled case texts and test retrieval-augmented prompts to observe changes in decision wording and threshold behavior.

Agent Features

Memory

  • Diagnosis Knowledge Base (DKB) storing distilled texts and weight sets

Planning

  • Iterative weight adjustment using Composite Reward until probability alignment (<=5% diff)

Tool Use

  • SHAP (explainability)
  • Qdrant (vector DB)
  • BGE embeddings + reranker

Frameworks

  • RL
  • Retrieval-Augmented Generation (RAG)

Is Agentic

true

Architectures

  • LLM as Actor in Actor-Critic loop
  • Retrieval-augmented LLM pipeline (LLM + DKB)

Collaboration

  • Single-agent LLM interacts with stored attribution library and critic; no multi-agent coordination r

Optimization Features

Token Efficiency

  • ACPB encodes SHAP as compact atomic facts to reduce token usage in prompts

Model Optimization

  • Distillation of attribution logic into compact weight sets and diagnostic texts stored in DKB

System Optimization

  • Early exit from RL loop when infer probability within 5% of XGBoost

Training Optimization

  • Actor-Critic loop to align LLM-inferred probabilities to teacher (XGBoost) outputs

Inference Optimization

  • ACPB avoids per-sample SHAP recomputation; retrieval of distilled cases reduces prompt engineering c

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to sarcopenia; generalizability to other diseases untested.
  • Framework tested with XGBoost + SHAP only; behavior with other models or attribution methods (LIME, IG) is unproven.
  • Relatively small experimental scale and limited repeated-run count (n=10) restricts statistical certainty.
  • Prompt framing affects operating characteristics (FPR/TPR) even if AUC is stable; requires careful prompt design.
  • Potential to inherit or amplify biases from the teacher model; misclassified teacher cases were not analyzed.

When Not To Use

  • When regulatory or legal requirements forbid any LLM-generated intermediate reasoning (audit logs alone are insufficient).
  • For tasks where teacher model is poor or highly biased — distillation can propagate those failures.
  • When you cannot afford careful prompt tuning or retrieval quality control, because operating point can shift with prompts.

Failure Modes

  • LLM adjusts weights in medically inconsistent ways if retrieval or ACPB signals are noisy.
  • Prompt variations shift decision threshold causing unstable clinical trade-offs.
  • DKB retrieval misses relevant precedents or returns spurious cases, degrading output correctness.
  • Framework inherits teacher-model biases (e.g., gender encoding errors) and may need external clinical constraints to correct them.

Core Entities

Models

  • XGBoost
  • Qwen3-plus (LLM)
  • Deepseek-R1-8B (LLM)
  • ACPB (Average Contribution Probability Base)
  • DKB (Diagnosis Knowledge Base)

Metrics

  • Accuracy
  • AUC
  • Precision
  • Recall
  • F1-score
  • Intraclass Correlation Coefficient (ICC)
  • Cohen's kappa
  • Mean absolute error (bias)

Datasets

  • CHARLS (China Health and Retirement Longitudinal Study)
  • NHANES (referenced for public availability in Data Availability)