Overview
The method is practically sound and tested on realistic private EMR QA; gains are large but come with increased inference cost and need for a held-out dataset to train estimators.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 75%
Why It Matters For Business
RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.
Who Should Care
Summary TLDR
RATP turns LLM reasoning into a multi-step decision process that plans which retrieved documents and intermediate "thoughts" to generate. It uses Monte‑Carlo Tree Search (MCTS) plus a scoring model (oracle, learned estimator, or LLM self-critic) to guide retrieval and reasoning. On private EMR QA (emrQA) RATP reaches 71% exact-match vs 24% for a single-document RAG baseline (and 34% for no retrieval), while an ideal oracle MCTS reaches 88%. RATP keeps private data out of model training, exposes the full reasoning trace, but raises inference cost because many LLM calls are needed.
Problem Statement
Healthcare data must remain private and is often updated, but LLMs trained without those private records cannot reliably answer EMR questions. Simple retrieval-augmented prompts (RAG) can confuse LLMs or worsen answers when retrieval is imperfect. We need a retrieval-plus-reasoning method that (1) avoids training on private data, (2) handles large document collections beyond the context window, (3) filters noisy/irrelevant documents, and (4) keeps outputs auditable for clinicians.
Main Contribution
Formalized open-book question answering as a multi-step Markov decision process where each intermediate "thought" and retrieved document are states.
Introduced RATP: an MCTS-based planner that builds thought graphs combining retrieved documents and LLM-generated thoughts, guided by a scoring model.
Key Findings
RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.
Oracle scoring shows upper-bound of the approach when perfect feedback is available.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | RATP 71% (±0.5) | RAG 24% (±0.4) | +47pp | emrQA test | Table 4 reports RAG 24% and RATP 71% on emrQA | Table 4 |
| Accuracy | MCTS oracle 88% (±0.3) | MCTS oracle w/o IR 52% (±0.4) | +36pp | emrQA test (oracle) | Table 3 shows oracle ablation with and without IR | Table 3 |
What To Try In 7 Days
Prototype RATP with a local LLM + Contriever on a small private dataset and collect thought traces for 100–300 queries.
Train a lightweight estimator (XGBoost/MLP) on collected runs to act as the scoring model and compare LLM-call counts and accuracy vs an LLM self-critic.
Tune max-thoughts and early-stopping thresholds to hit your cost/latency budget before production.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Private EMR dataset access is rare; results rely on data guaranteed not in LLM training sets.
Self-critic doubles LLM calls; model-based estimator needs held-out oracle runs to train.
When Not To Use
When low-latency or strict compute budgets prevent many LLM calls.
When you cannot obtain an offline held-out dataset to train an estimator and must avoid expensive self-critique.
Failure Modes
LLM fixates on an incorrect answer and cannot pivot within the allowed thought budget.
Self-critic hallucination gives overconfident wrong scores, misleading the search.

