Turn SHAP attributions into a reusable knowledge base and train an LLM to reason with it for more accurate, auditable sarcopenia diagnosis.

Overview

Decision SnapshotNeeds Validation

Proof-of-concept with clear engineering contributions. Results rely on a single dataset and limited repeated runs (n=10). Good signs for reproducibility and interpretability, but further multi-center validation and code release are needed before clinical production.

Citations0

Evidence Strength0.60

Confidence0.72

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 65%

Authors

Yuqi Jin, Zhenhao Shuai, Zihan Hu, Weiteng Zhang, Weihao Xie, Jianwei Shuai, Xian Shen, Zhen Feng

Links

Abstract / PDF / Data

Why It Matters For Business

CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.

Who Should Care

Product Manager ML Engineer Data Scientist

Summary TLDR

CANDLE converts XGBoost+SHAP attributions into a compact, retrievable 'Average Contribution Probability Base' (ACPB), uses an actor-critic loop to train an LLM to adjust feature weights, stores distilled case texts in a Diagnosis Knowledge Base (DKB), and at test time retrieves precedents to produce an interpretable sarcopenia prediction. On the evaluated dataset it increased accuracy by ~6 percentage points vs. XGBoost while keeping high reproducibility.

Problem Statement

Clinical tasks with structured data need both accurate predictions and traceable, auditable reasoning. Standard ML gives traceable feature attributions but lacks semantic reasoning. LLMs provide reasoning but are opaque and unstable. The paper asks: can we translate ML attribution logic into a form LLMs can internalize and distill, to gain both improved accuracy and interpretable, reproducible clinical decisions?

Main Contribution

ACPB: a compact Average Contribution Probability Base that converts SHAP values into interval-based contribution probabilities that can be queried without re-running SHAP.

Actor-Critic cross-modal distillation: an RL loop where an LLM (Actor) proposes adjusted feature weights and a Composite Reward (Critic) drives alignment to XGBoost probabilities until within 5%.

Key Findings

CANDLE increased overall accuracy compared to the XGBoost baseline.

NumbersAccuracy: CANDLE (LLM with DKB) 79.3% vs XGBoost 73.3% (+6.0 percentage points) (Table 1).

Practical UseIf you already use an XGBoost teacher model and SHAP attributions, applying ACPB+LLM distillation can yield modest accuracy gains (~6pp) while keeping attribution logic available for audits.

Evidence RefTable 1

Predictions from the distilled LLM are highly reproducible under fixed prompts.

NumbersProbabilities ICC = 0.956; Cohen's kappa = 0.842; sample-level label consistency = 95.5% (Table 2).

Practical UseFor deployment requiring repeatability, this hybrid preserves very stable outputs under identical inputs and prompts — helpful for audits and clinical traceability.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	79.3%	XGBoost 73.3%	+6.0 percentage points	evaluated dataset (sarcopenia prediction)	Table 1 shows CANDLE (LLM with DKB) accuracy 79.3% vs XGBoost 73.3%.	Table 1
AUC (mean ± sd)	0.760 ± 0.010	not explicitly compared for AUC	—	repeated runs under fixed prompts (N=10)	Table 3 reports AUC 0.760 ± 0.010 across repeated runs.	Table 3

What To Try In 7 Days

Extract SHAP attributions from your existing XGBoost/ensemble model, aggregate them into simple interval buckets for frequent features (prototype ACPB).

Run a small LLM prompt that consumes those atomic facts and asks for adjusted feature weights; compare the LLM-inferred probability vs model output and log discrepancies.

Set up a tiny retrieval store (e.g., Qdrant) with a few distilled case texts and test retrieval-augmented prompts to observe changes in decision wording and threshold behavior.

Agent Features

Memory

Diagnosis Knowledge Base (DKB) storing distilled texts and weight sets

Planning

Iterative weight adjustment using Composite Reward until probability alignment (<=5% diff)

Tool Use

SHAP (explainability)Qdrant (vector DB)BGE embeddings + reranker

Frameworks

RLRetrieval-Augmented Generation (RAG)

Is Agentic

Yes

Architectures

LLM as Actor in Actor-Critic loopRetrieval-augmented LLM pipeline (LLM + DKB)

Collaboration

Single-agent LLM interacts with stored attribution library and critic; no multi-agent coordination r

Optimization Features

Token Efficiency

ACPB encodes SHAP as compact atomic facts to reduce token usage in prompts

Model Optimization

Distillation of attribution logic into compact weight sets and diagnostic texts stored in DKB

System Optimization

Early exit from RL loop when infer probability within 5% of XGBoost

Training Optimization

Actor-Critic loop to align LLM-inferred probabilities to teacher (XGBoost) outputs

Inference Optimization

ACPB avoids per-sample SHAP recomputation; retrieval of distilled cases reduces prompt engineering c

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.cdc.gov/nchs/nhanes/index.htm

Risks & Boundaries

Limitations

Evaluation limited to sarcopenia; generalizability to other diseases untested.

Framework tested with XGBoost + SHAP only; behavior with other models or attribution methods (LIME, IG) is unproven.

When Not To Use

When regulatory or legal requirements forbid any LLM-generated intermediate reasoning (audit logs alone are insufficient).

For tasks where teacher model is poor or highly biased — distillation can propagate those failures.

Failure Modes

LLM adjusts weights in medically inconsistent ways if retrieval or ACPB signals are noisy.

Prompt variations shift decision threshold causing unstable clinical trade-offs.

Core Entities

Models

XGBoostQwen3-plus (LLM)Deepseek-R1-8B (LLM)ACPB (Average Contribution Probability Base)DKB (Diagnosis Knowledge Base)

Metrics

AccuracyAUCPrecisionRecallF1-scoreIntraclass Correlation Coefficient (ICC)Cohen's kappaMean absolute error (bias)

Datasets

CHARLS (China Health and Retirement Longitudinal Study)NHANES (referenced for public availability in Data Availability)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CANDLE increased overall accuracy compared to the XGBoost baseline.

Predictions from the distilled LLM are highly reproducible under fixed prompts.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding