Overview
Production Readiness
0.4
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.
Summary TLDR
CANDLE converts XGBoost+SHAP attributions into a compact, retrievable 'Average Contribution Probability Base' (ACPB), uses an actor-critic loop to train an LLM to adjust feature weights, stores distilled case texts in a Diagnosis Knowledge Base (DKB), and at test time retrieves precedents to produce an interpretable sarcopenia prediction. On the evaluated dataset it increased accuracy by ~6 percentage points vs. XGBoost while keeping high reproducibility.
Problem Statement
Clinical tasks with structured data need both accurate predictions and traceable, auditable reasoning. Standard ML gives traceable feature attributions but lacks semantic reasoning. LLMs provide reasoning but are opaque and unstable. The paper asks: can we translate ML attribution logic into a form LLMs can internalize and distill, to gain both improved accuracy and interpretable, reproducible clinical decisions?
Main Contribution
ACPB: a compact Average Contribution Probability Base that converts SHAP values into interval-based contribution probabilities that can be queried without re-running SHAP.
Actor-Critic cross-modal distillation: an RL loop where an LLM (Actor) proposes adjusted feature weights and a Composite Reward (Critic) drives alignment to XGBoost probabilities until within 5%.
DKB + FGMR retrieval: store distilled diagnostic texts and weight sets; retrieve similar cases via Feature-Grouped Multi-Round Retrieval (FGMR) and use them as prompt context for final predictions.
Empirical proof-of-concept for sarcopenia: CANDLE improves accuracy and reports consistency metrics (ICC, Cohen's kappa, AUC) showing reproducible outputs under fixed prompts.
Key Findings
CANDLE increased overall accuracy compared to the XGBoost baseline.
Predictions from the distilled LLM are highly reproducible under fixed prompts.
Discrimination (ranking ability) stayed stable across runs while operating point shifted with prompt framing.
ACPB recreates XGBoost probabilities with very low bias when averaged.
Results
Accuracy
AUC (mean ± sd)
Probability output reproducibility
Binary classification reproducibility
ACPB vs XGBoost probability bias
Who Should Care
What To Try In 7 Days
Extract SHAP attributions from your existing XGBoost/ensemble model, aggregate them into simple interval buckets for frequent features (prototype ACPB).
Run a small LLM prompt that consumes those atomic facts and asks for adjusted feature weights; compare the LLM-inferred probability vs model output and log discrepancies.
Set up a tiny retrieval store (e.g., Qdrant) with a few distilled case texts and test retrieval-augmented prompts to observe changes in decision wording and threshold behavior.
Agent Features
Memory
- Diagnosis Knowledge Base (DKB) storing distilled texts and weight sets
Planning
- Iterative weight adjustment using Composite Reward until probability alignment (<=5% diff)
Tool Use
- SHAP (explainability)
- Qdrant (vector DB)
- BGE embeddings + reranker
Frameworks
- RL
- Retrieval-Augmented Generation (RAG)
Is Agentic
true
Architectures
- LLM as Actor in Actor-Critic loop
- Retrieval-augmented LLM pipeline (LLM + DKB)
Collaboration
- Single-agent LLM interacts with stored attribution library and critic; no multi-agent coordination r
Optimization Features
Token Efficiency
- ACPB encodes SHAP as compact atomic facts to reduce token usage in prompts
Model Optimization
- Distillation of attribution logic into compact weight sets and diagnostic texts stored in DKB
System Optimization
- Early exit from RL loop when infer probability within 5% of XGBoost
Training Optimization
- Actor-Critic loop to align LLM-inferred probabilities to teacher (XGBoost) outputs
Inference Optimization
- ACPB avoids per-sample SHAP recomputation; retrieval of distilled cases reduces prompt engineering c
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to sarcopenia; generalizability to other diseases untested.
- Framework tested with XGBoost + SHAP only; behavior with other models or attribution methods (LIME, IG) is unproven.
- Relatively small experimental scale and limited repeated-run count (n=10) restricts statistical certainty.
- Prompt framing affects operating characteristics (FPR/TPR) even if AUC is stable; requires careful prompt design.
- Potential to inherit or amplify biases from the teacher model; misclassified teacher cases were not analyzed.
When Not To Use
- When regulatory or legal requirements forbid any LLM-generated intermediate reasoning (audit logs alone are insufficient).
- For tasks where teacher model is poor or highly biased — distillation can propagate those failures.
- When you cannot afford careful prompt tuning or retrieval quality control, because operating point can shift with prompts.
Failure Modes
- LLM adjusts weights in medically inconsistent ways if retrieval or ACPB signals are noisy.
- Prompt variations shift decision threshold causing unstable clinical trade-offs.
- DKB retrieval misses relevant precedents or returns spurious cases, degrading output correctness.
- Framework inherits teacher-model biases (e.g., gender encoding errors) and may need external clinical constraints to correct them.
Core Entities
Models
- XGBoost
- Qwen3-plus (LLM)
- Deepseek-R1-8B (LLM)
- ACPB (Average Contribution Probability Base)
- DKB (Diagnosis Knowledge Base)
Metrics
- Accuracy
- AUC
- Precision
- Recall
- F1-score
- Intraclass Correlation Coefficient (ICC)
- Cohen's kappa
- Mean absolute error (bias)
Datasets
- CHARLS (China Health and Retirement Longitudinal Study)
- NHANES (referenced for public availability in Data Availability)

