Overview
Proof-of-concept with clear engineering contributions. Results rely on a single dataset and limited repeated runs (n=10). Good signs for reproducibility and interpretability, but further multi-center validation and code release are needed before clinical production.
Citations0
Evidence Strength0.60
Confidence0.72
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 65%
Why It Matters For Business
CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.
Who Should Care
Summary TLDR
CANDLE converts XGBoost+SHAP attributions into a compact, retrievable 'Average Contribution Probability Base' (ACPB), uses an actor-critic loop to train an LLM to adjust feature weights, stores distilled case texts in a Diagnosis Knowledge Base (DKB), and at test time retrieves precedents to produce an interpretable sarcopenia prediction. On the evaluated dataset it increased accuracy by ~6 percentage points vs. XGBoost while keeping high reproducibility.
Problem Statement
Clinical tasks with structured data need both accurate predictions and traceable, auditable reasoning. Standard ML gives traceable feature attributions but lacks semantic reasoning. LLMs provide reasoning but are opaque and unstable. The paper asks: can we translate ML attribution logic into a form LLMs can internalize and distill, to gain both improved accuracy and interpretable, reproducible clinical decisions?
Main Contribution
ACPB: a compact Average Contribution Probability Base that converts SHAP values into interval-based contribution probabilities that can be queried without re-running SHAP.
Actor-Critic cross-modal distillation: an RL loop where an LLM (Actor) proposes adjusted feature weights and a Composite Reward (Critic) drives alignment to XGBoost probabilities until within 5%.
Key Findings
CANDLE increased overall accuracy compared to the XGBoost baseline.
Predictions from the distilled LLM are highly reproducible under fixed prompts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 79.3% | XGBoost 73.3% | +6.0 percentage points | evaluated dataset (sarcopenia prediction) | Table 1 shows CANDLE (LLM with DKB) accuracy 79.3% vs XGBoost 73.3%. | Table 1 |
| AUC (mean ± sd) | 0.760 ± 0.010 | not explicitly compared for AUC | — | repeated runs under fixed prompts (N=10) | Table 3 reports AUC 0.760 ± 0.010 across repeated runs. | Table 3 |
What To Try In 7 Days
Extract SHAP attributions from your existing XGBoost/ensemble model, aggregate them into simple interval buckets for frequent features (prototype ACPB).
Run a small LLM prompt that consumes those atomic facts and asks for adjusted feature weights; compare the LLM-inferred probability vs model output and log discrepancies.
Set up a tiny retrieval store (e.g., Qdrant) with a few distilled case texts and test retrieval-augmented prompts to observe changes in decision wording and threshold behavior.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Single-agent LLM interacts with stored attribution library and critic; no multi-agent coordination r
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
ACPB avoids per-sample SHAP recomputation; retrieval of distilled cases reduces prompt engineering c
Reproducibility
Risks & Boundaries
Limitations
Evaluation limited to sarcopenia; generalizability to other diseases untested.
Framework tested with XGBoost + SHAP only; behavior with other models or attribution methods (LIME, IG) is unproven.
When Not To Use
When regulatory or legal requirements forbid any LLM-generated intermediate reasoning (audit logs alone are insufficient).
For tasks where teacher model is poor or highly biased — distillation can propagate those failures.
Failure Modes
LLM adjusts weights in medically inconsistent ways if retrieval or ACPB signals are noisy.
Prompt variations shift decision threshold causing unstable clinical trade-offs.

