Plan LLM 'thoughts' with MCTS to answer private medical records safely

February 12, 20248 min

Overview

Decision SnapshotNeeds Validation

The method is practically sound and tested on realistic private EMR QA; gains are large but come with increased inference cost and need for a held-out dataset to train estimators.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 75%

Authors

Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar

Links

Abstract / PDF

Why It Matters For Business

RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.

Who Should Care

Summary TLDR

RATP turns LLM reasoning into a multi-step decision process that plans which retrieved documents and intermediate "thoughts" to generate. It uses Monte‑Carlo Tree Search (MCTS) plus a scoring model (oracle, learned estimator, or LLM self-critic) to guide retrieval and reasoning. On private EMR QA (emrQA) RATP reaches 71% exact-match vs 24% for a single-document RAG baseline (and 34% for no retrieval), while an ideal oracle MCTS reaches 88%. RATP keeps private data out of model training, exposes the full reasoning trace, but raises inference cost because many LLM calls are needed.

Problem Statement

Healthcare data must remain private and is often updated, but LLMs trained without those private records cannot reliably answer EMR questions. Simple retrieval-augmented prompts (RAG) can confuse LLMs or worsen answers when retrieval is imperfect. We need a retrieval-plus-reasoning method that (1) avoids training on private data, (2) handles large document collections beyond the context window, (3) filters noisy/irrelevant documents, and (4) keeps outputs auditable for clinicians.

Main Contribution

Formalized open-book question answering as a multi-step Markov decision process where each intermediate "thought" and retrieved document are states.

Introduced RATP: an MCTS-based planner that builds thought graphs combining retrieved documents and LLM-generated thoughts, guided by a scoring model.

Key Findings

RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.

NumbersemrQA exact-match: RAG 24% → RATP 71% (+47 pp)

Practical UseIf you must answer private EMR questions without training an LLM, use a planned multi-step retrieval+reasoning pipeline (RATP) to improve accuracy substantially.

Evidence RefTable 4; Table 3

Oracle scoring shows upper-bound of the approach when perfect feedback is available.

NumbersemrQA exact-match: MCTS oracle 88% vs MCTS oracle w/o IR 52% (+36 pp)

Practical UseInvesting in accurate scoring signals (human or high-quality estimator) materially raises final accuracy; build a good scoring model if possible.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyRATP 71%0.5)RAG 24%0.4)+47ppemrQA testTable 4 reports RAG 24% and RATP 71% on emrQATable 4
AccuracyMCTS oracle 88%0.3)MCTS oracle w/o IR 52%0.4)+36ppemrQA test (oracle)Table 3 shows oracle ablation with and without IRTable 3

What To Try In 7 Days

Prototype RATP with a local LLM + Contriever on a small private dataset and collect thought traces for 100–300 queries.

Train a lightweight estimator (XGBoost/MLP) on collected runs to act as the scoring model and compare LLM-call counts and accuracy vs an LLM self-critic.

Tune max-thoughts and early-stopping thresholds to hit your cost/latency budget before production.

Agent Features

Memory
Retrieval memory (external DB chunks treated as thoughts)
Planning
Monte‑Carlo Tree SearchLoRA
Tool Use
Dense retriever (Contriever)Scoring models (MLP/XGBoost or LLM self-critic)
Frameworks
MCTS
Is Agentic

Yes

Architectures
LLM + MCTS planning

Optimization Features

Token Efficiency
Multi-step retrieval keeps prompts smaller than dumping full KB
Infra Optimization
Local LLM deployment to avoid sending private text to external APIs
Model Optimization
Learned scoring model (estimator) for cheaper inference
System Optimization
Batch retrievals and limit thought depth (T)
Training Optimization
Train estimator from offline MCTS-oracle runs
Inference Optimization
Use estimator to reduce LLM self-critique callsEarly stopping by score threshold

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Private EMR dataset access is rare; results rely on data guaranteed not in LLM training sets.

Self-critic doubles LLM calls; model-based estimator needs held-out oracle runs to train.

When Not To Use

When low-latency or strict compute budgets prevent many LLM calls.

When you cannot obtain an offline held-out dataset to train an estimator and must avoid expensive self-critique.

Failure Modes

LLM fixates on an incorrect answer and cannot pivot within the allowed thought budget.

Self-critic hallucination gives overconfident wrong scores, misleading the search.

Core Entities

Models

Mixtral8x7BLlama-2 70BGemma 2BGPT-3.5-turboGPT-4

Metrics

Exact Match (SQuAD style)AccuracyMSE (scoring models)

Datasets

emrQAEHRQABoolQMIMIC-IV

Benchmarks

Private EMR QA (emrQA)Discharge QA (EHRQA)BoolQ (public)