Plan LLM 'thoughts' with MCTS to answer private medical records safely

February 12, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.75

Cost Impact Score

0.6

Citation Count

3

Authors

Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar

Links

Abstract / PDF

Why It Matters For Business

RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.

Summary TLDR

RATP turns LLM reasoning into a multi-step decision process that plans which retrieved documents and intermediate "thoughts" to generate. It uses Monte‑Carlo Tree Search (MCTS) plus a scoring model (oracle, learned estimator, or LLM self-critic) to guide retrieval and reasoning. On private EMR QA (emrQA) RATP reaches 71% exact-match vs 24% for a single-document RAG baseline (and 34% for no retrieval), while an ideal oracle MCTS reaches 88%. RATP keeps private data out of model training, exposes the full reasoning trace, but raises inference cost because many LLM calls are needed.

Problem Statement

Healthcare data must remain private and is often updated, but LLMs trained without those private records cannot reliably answer EMR questions. Simple retrieval-augmented prompts (RAG) can confuse LLMs or worsen answers when retrieval is imperfect. We need a retrieval-plus-reasoning method that (1) avoids training on private data, (2) handles large document collections beyond the context window, (3) filters noisy/irrelevant documents, and (4) keeps outputs auditable for clinicians.

Main Contribution

Formalized open-book question answering as a multi-step Markov decision process where each intermediate "thought" and retrieved document are states.

Introduced RATP: an MCTS-based planner that builds thought graphs combining retrieved documents and LLM-generated thoughts, guided by a scoring model.

Evaluated on private EMR datasets (emrQA, EHRQA) and public BoolQ; shows large accuracy gains on private EMRs and provides a practical guide for deployment.

Key Findings

RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.

NumbersemrQA exact-match: RAG 24% → RATP 71% (+47 pp)

Oracle scoring shows upper-bound of the approach when perfect feedback is available.

NumbersemrQA exact-match: MCTS oracle 88% vs MCTS oracle w/o IR 52% (+36 pp)

Simple RAG can hurt performance in private EMR settings.

NumbersemrQA: LLM 34% vs RAG 24% (RAG lower than no retrieval)

Learned model-based scoring predicts thought value better than LLM self-critique.

NumbersScore prediction: Estimation accuracy 73% vs Self-critic 42%; MSE 0.12 vs 0.60

MCTS planning + IR increases robustness to bad retrieval and enables interpretability.

NumbersOracle w/ IR 88% vs oracle w/o IR 52%; thought traces available for each answer

Inference cost scales roughly linearly with number of thoughts; typical success occurs by ~20 thoughts.

NumbersThought-process limit T set to 25; accuracy plateaus near 20 thoughts; number of LLM queries grows linearly (Figure 4)

Results

Accuracy

ValueRATP 71% (±0.5)

BaselineRAG 24% (±0.4)

Accuracy

ValueMCTS oracle 88% (±0.3)

BaselineMCTS oracle w/o IR 52% (±0.4)

Accuracy

ValueRATP 72% (±0.8)

BaselineRAG 67% (±0.6)

Scoring model prediction

ValueEstimation accuracy 73%, MSE 0.12

BaselineSelf-critic accuracy 42%, MSE 0.60

Discharge QA (EHRQA)

ValueRATP 60% (±0.7)

BaselineRAG 56% (±0.7)

Who Should Care

What To Try In 7 Days

Prototype RATP with a local LLM + Contriever on a small private dataset and collect thought traces for 100–300 queries.

Train a lightweight estimator (XGBoost/MLP) on collected runs to act as the scoring model and compare LLM-call counts and accuracy vs an LLM self-critic.

Tune max-thoughts and early-stopping thresholds to hit your cost/latency budget before production.

Agent Features

Memory

  • Retrieval memory (external DB chunks treated as thoughts)

Planning

  • Monte‑Carlo Tree Search
  • LoRA

Tool Use

  • Dense retriever (Contriever)
  • Scoring models (MLP/XGBoost or LLM self-critic)

Frameworks

  • MCTS

Is Agentic

true

Architectures

  • LLM + MCTS planning

Optimization Features

Token Efficiency

  • Multi-step retrieval keeps prompts smaller than dumping full KB

Infra Optimization

  • Local LLM deployment to avoid sending private text to external APIs

Model Optimization

  • Learned scoring model (estimator) for cheaper inference

System Optimization

  • Batch retrievals and limit thought depth (T)

Training Optimization

  • Train estimator from offline MCTS-oracle runs

Inference Optimization

  • Use estimator to reduce LLM self-critique calls
  • Early stopping by score threshold

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Private EMR dataset access is rare; results rely on data guaranteed not in LLM training sets.
  • Self-critic doubles LLM calls; model-based estimator needs held-out oracle runs to train.
  • Inference cost and latency grow with number of thought steps (typical T ≈ 20–25).
  • Other privacy risks (interception, profiling) are out of scope.

When Not To Use

  • When low-latency or strict compute budgets prevent many LLM calls.
  • When you cannot obtain an offline held-out dataset to train an estimator and must avoid expensive self-critique.
  • When retriever quality is extremely poor and you cannot improve retrieval at all.

Failure Modes

  • LLM fixates on an incorrect answer and cannot pivot within the allowed thought budget.
  • Self-critic hallucination gives overconfident wrong scores, misleading the search.
  • Frequent irrelevant retrievals waste budget and mislead the planner.
  • Performance depends on LLM reasoning ability; small models may not benefit.

Core Entities

Models

  • Mixtral8x7B
  • Llama-2 70B
  • Gemma 2B
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • Exact Match (SQuAD style)
  • Accuracy
  • MSE (scoring models)

Datasets

  • emrQA
  • EHRQA
  • BoolQ
  • MIMIC-IV

Benchmarks

  • Private EMR QA (emrQA)
  • Discharge QA (EHRQA)
  • BoolQ (public)