Plan LLM 'thoughts' with MCTS to answer private medical records safely

Overview

Decision SnapshotNeeds Validation

The method is practically sound and tested on realistic private EMR QA; gains are large but come with increased inference cost and need for a held-out dataset to train estimators.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 75%

Authors

Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar

Links

Abstract / PDF

Why It Matters For Business

RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

RATP turns LLM reasoning into a multi-step decision process that plans which retrieved documents and intermediate "thoughts" to generate. It uses Monte‑Carlo Tree Search (MCTS) plus a scoring model (oracle, learned estimator, or LLM self-critic) to guide retrieval and reasoning. On private EMR QA (emrQA) RATP reaches 71% exact-match vs 24% for a single-document RAG baseline (and 34% for no retrieval), while an ideal oracle MCTS reaches 88%. RATP keeps private data out of model training, exposes the full reasoning trace, but raises inference cost because many LLM calls are needed.

Problem Statement

Healthcare data must remain private and is often updated, but LLMs trained without those private records cannot reliably answer EMR questions. Simple retrieval-augmented prompts (RAG) can confuse LLMs or worsen answers when retrieval is imperfect. We need a retrieval-plus-reasoning method that (1) avoids training on private data, (2) handles large document collections beyond the context window, (3) filters noisy/irrelevant documents, and (4) keeps outputs auditable for clinicians.

Main Contribution

Formalized open-book question answering as a multi-step Markov decision process where each intermediate "thought" and retrieved document are states.

Introduced RATP: an MCTS-based planner that builds thought graphs combining retrieved documents and LLM-generated thoughts, guided by a scoring model.

Key Findings

RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.

NumbersemrQA exact-match: RAG 24% → RATP 71% (+47 pp)

Practical UseIf you must answer private EMR questions without training an LLM, use a planned multi-step retrieval+reasoning pipeline (RATP) to improve accuracy substantially.

Evidence RefTable 4; Table 3

Oracle scoring shows upper-bound of the approach when perfect feedback is available.

NumbersemrQA exact-match: MCTS oracle 88% vs MCTS oracle w/o IR 52% (+36 pp)

Practical UseInvesting in accurate scoring signals (human or high-quality estimator) materially raises final accuracy; build a good scoring model if possible.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	RATP 71% (±0.5)	RAG 24% (±0.4)	+47pp	emrQA test	Table 4 reports RAG 24% and RATP 71% on emrQA	Table 4
Accuracy	MCTS oracle 88% (±0.3)	MCTS oracle w/o IR 52% (±0.4)	+36pp	emrQA test (oracle)	Table 3 shows oracle ablation with and without IR	Table 3

What To Try In 7 Days

Prototype RATP with a local LLM + Contriever on a small private dataset and collect thought traces for 100–300 queries.

Train a lightweight estimator (XGBoost/MLP) on collected runs to act as the scoring model and compare LLM-call counts and accuracy vs an LLM self-critic.

Tune max-thoughts and early-stopping thresholds to hit your cost/latency budget before production.

Agent Features

Memory

Retrieval memory (external DB chunks treated as thoughts)

Planning

Monte‑Carlo Tree SearchLoRA

Tool Use

Dense retriever (Contriever)Scoring models (MLP/XGBoost or LLM self-critic)

Frameworks

MCTS

Is Agentic

Yes

Architectures

LLM + MCTS planning

Optimization Features

Token Efficiency

Multi-step retrieval keeps prompts smaller than dumping full KB

Infra Optimization

Local LLM deployment to avoid sending private text to external APIs

Model Optimization

Learned scoring model (estimator) for cheaper inference

System Optimization

Batch retrievals and limit thought depth (T)

Training Optimization

Train estimator from offline MCTS-oracle runs

Inference Optimization

Use estimator to reduce LLM self-critique callsEarly stopping by score threshold

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Private EMR dataset access is rare; results rely on data guaranteed not in LLM training sets.

Self-critic doubles LLM calls; model-based estimator needs held-out oracle runs to train.

When Not To Use

When low-latency or strict compute budgets prevent many LLM calls.

When you cannot obtain an offline held-out dataset to train an estimator and must avoid expensive self-critique.

Failure Modes

LLM fixates on an incorrect answer and cannot pivot within the allowed thought budget.

Self-critic hallucination gives overconfident wrong scores, misleading the search.

Core Entities

Models

Mixtral8x7BLlama-2 70BGemma 2BGPT-3.5-turboGPT-4

Metrics

Exact Match (SQuAD style)AccuracyMSE (scoring models)

Datasets

emrQAEHRQABoolQMIMIC-IV

Benchmarks

Private EMR QA (emrQA)Discharge QA (EHRQA)BoolQ (public)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.

Oracle scoring shows upper-bound of the approach when perfect feedback is available.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding