Overview
Production Readiness
0.6
Novelty Score
0.75
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.
Summary TLDR
RATP turns LLM reasoning into a multi-step decision process that plans which retrieved documents and intermediate "thoughts" to generate. It uses Monte‑Carlo Tree Search (MCTS) plus a scoring model (oracle, learned estimator, or LLM self-critic) to guide retrieval and reasoning. On private EMR QA (emrQA) RATP reaches 71% exact-match vs 24% for a single-document RAG baseline (and 34% for no retrieval), while an ideal oracle MCTS reaches 88%. RATP keeps private data out of model training, exposes the full reasoning trace, but raises inference cost because many LLM calls are needed.
Problem Statement
Healthcare data must remain private and is often updated, but LLMs trained without those private records cannot reliably answer EMR questions. Simple retrieval-augmented prompts (RAG) can confuse LLMs or worsen answers when retrieval is imperfect. We need a retrieval-plus-reasoning method that (1) avoids training on private data, (2) handles large document collections beyond the context window, (3) filters noisy/irrelevant documents, and (4) keeps outputs auditable for clinicians.
Main Contribution
Formalized open-book question answering as a multi-step Markov decision process where each intermediate "thought" and retrieved document are states.
Introduced RATP: an MCTS-based planner that builds thought graphs combining retrieved documents and LLM-generated thoughts, guided by a scoring model.
Evaluated on private EMR datasets (emrQA, EHRQA) and public BoolQ; shows large accuracy gains on private EMRs and provides a practical guide for deployment.
Key Findings
RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.
Oracle scoring shows upper-bound of the approach when perfect feedback is available.
Simple RAG can hurt performance in private EMR settings.
Learned model-based scoring predicts thought value better than LLM self-critique.
MCTS planning + IR increases robustness to bad retrieval and enables interpretability.
Inference cost scales roughly linearly with number of thoughts; typical success occurs by ~20 thoughts.
Results
Accuracy
Accuracy
Accuracy
Scoring model prediction
Discharge QA (EHRQA)
Who Should Care
What To Try In 7 Days
Prototype RATP with a local LLM + Contriever on a small private dataset and collect thought traces for 100–300 queries.
Train a lightweight estimator (XGBoost/MLP) on collected runs to act as the scoring model and compare LLM-call counts and accuracy vs an LLM self-critic.
Tune max-thoughts and early-stopping thresholds to hit your cost/latency budget before production.
Agent Features
Memory
- Retrieval memory (external DB chunks treated as thoughts)
Planning
- Monte‑Carlo Tree Search
- LoRA
Tool Use
- Dense retriever (Contriever)
- Scoring models (MLP/XGBoost or LLM self-critic)
Frameworks
- MCTS
Is Agentic
true
Architectures
- LLM + MCTS planning
Optimization Features
Token Efficiency
- Multi-step retrieval keeps prompts smaller than dumping full KB
Infra Optimization
- Local LLM deployment to avoid sending private text to external APIs
Model Optimization
- Learned scoring model (estimator) for cheaper inference
System Optimization
- Batch retrievals and limit thought depth (T)
Training Optimization
- Train estimator from offline MCTS-oracle runs
Inference Optimization
- Use estimator to reduce LLM self-critique calls
- Early stopping by score threshold
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Private EMR dataset access is rare; results rely on data guaranteed not in LLM training sets.
- Self-critic doubles LLM calls; model-based estimator needs held-out oracle runs to train.
- Inference cost and latency grow with number of thought steps (typical T ≈ 20–25).
- Other privacy risks (interception, profiling) are out of scope.
When Not To Use
- When low-latency or strict compute budgets prevent many LLM calls.
- When you cannot obtain an offline held-out dataset to train an estimator and must avoid expensive self-critique.
- When retriever quality is extremely poor and you cannot improve retrieval at all.
Failure Modes
- LLM fixates on an incorrect answer and cannot pivot within the allowed thought budget.
- Self-critic hallucination gives overconfident wrong scores, misleading the search.
- Frequent irrelevant retrievals waste budget and mislead the planner.
- Performance depends on LLM reasoning ability; small models may not benefit.
Core Entities
Models
- Mixtral8x7B
- Llama-2 70B
- Gemma 2B
- GPT-3.5-turbo
- GPT-4
Metrics
- Exact Match (SQuAD style)
- Accuracy
- MSE (scoring models)
Datasets
- emrQA
- EHRQA
- BoolQ
- MIMIC-IV
Benchmarks
- Private EMR QA (emrQA)
- Discharge QA (EHRQA)
- BoolQ (public)

