Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
IntroPlan reduces unnecessary user queries and unsafe actions by aligning model uncertainty to task ambiguity; it improves action precision and safety while using modest extra compute and a small curated knowledge base.
Summary TLDR
The paper introduces IntroPlan: a retrieval-augmented planning pipeline that stores human-aligned, post-hoc reasoning examples (introspective rationales) in a small knowledge base and uses them to prompt LLM planners at runtime. When combined with conformal prediction (a statistical calibration method), this reduces unnecessary clarification requests while keeping a guaranteed probability that the correct action is included. Evaluations on three robot-style datasets (including a new Safe Mobile Manipulation benchmark) show large gains in precise prediction sets and safety metrics. The method works with modest KB sizes (≈100–200 entries) and off-the-shelf LLMs (GPT-3.5 / GPT-4 Turbo).
Problem Statement
LLM planners can hallucinate or be overconfident on ambiguous natural-language robot tasks. Robots need reliable uncertainty estimates so they either act safely or ask the right follow-up question. Existing retrieval and calibration methods either hallucinate when grounding is weak or become over-conservative and over-ask.
Main Contribution
Introspective planning: build a small knowledge base of LLM-generated, human-aligned post-hoc rationales and retrieve them as few-shot examples to make LLM planners reason about uncertainty and safety.
Integration of introspective retrieval with conformal prediction to tighten statistical coverage bounds and reduce unnecessary user queries while maintaining a coverage guarantee.
A new Safe Mobile Manipulation benchmark and safety-focused metrics (e.g., Unsafe Contamination Rate) to evaluate compliance and safety in LLM-based planners.
Key Findings
Direct IntroPlan (no conformal) yields much more precise prediction sets on Safe Mobile Manipulation with GPT-4.
IntroPlan + conformal prediction keeps statistical coverage while reducing over-asking and contamination versus prior conformal baseline KnowNo.
Small knowledge bases are sufficient: performance saturates near 100–200 examples.
Results
Safe Mobile Manipulation — GPT-4 (Direct IntroPlan)
Safe Mobile Manipulation — GPT-4 (IntroPlan + Conformal)
KB size sensitivity
Who Should Care
What To Try In 7 Days
Build a 50–200 example KB of common tasks and post-hoc rationales for your robot workflow.
Add retrieved rationale examples to your LLM prompts and compare direct vs conformal outputs on a small safety test set.
Calibrate LLM confidence with a 400-instance calibration set and measure help rate and unsafe contamination.
Agent Features
Memory
- Retrieval memory (small KB of reasoning examples)
Planning
- Planning with LLMs
- Introspective reasoning (post-hoc rationales)
Tool Use
- Knowledge retrieval (SentenceBERT embeddings)
- In-context few-shot prompting (retrieved rationales)
- Conformal predictor for calibrated sets
Frameworks
- RAG (retrieval-augmented generation)
- Conformal prediction
Is Agentic
true
Architectures
- LLM planner + retrieval KB
- Conformal prediction wrapper
Collaboration
- Ask user clarification when conformal set contains multiple valid options
Optimization Features
Token Efficiency
- KB sizes kept modest (100–200) to control prompt token cost
Inference Optimization
- Use top-m retrieval (m=3) to limit prompt size
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-label conformal prediction assumes mutually exclusive options; multi-label conformal attempt was conservative and underperformed.
- Experiments use closed-source LLM APIs (GPT-3.5 / GPT-4), so behavior depends on those models and may vary with other LLMs.
- No statistical error bars reported; results are averaged single runs due to API cost constraints.
When Not To Use
- When you need provable multi-label calibration (paper shows single-label works better here).
- If you cannot afford LLM API costs for retrieval and calibration prompts at inference scale.
- If your task domain lacks clear human-labeled valid-option examples to build the KB.
Failure Modes
- KB generation can inherit LLM hallucinations if ground-truth labels are noisy, producing misleading rationales.
- Conformal sets can become overly conservative if calibration data does not match the test distribution.
- Method may still include unsafe options if the LLM cannot reason about a specific safety rule not represented in the KB.
Core Entities
Models
- GPT-4 Turbo (gpt-4-1106-preview)
- GPT-3.5 (text-davinci-003)
Metrics
- Success Rate
- Help Rate
- Exact Set Rate
- Non-compliant Contamination Rate
- Unsafe Contamination Rate
- Overask Rate
- Overstep Rate
- Unsafe Rate
Datasets
- Mobile Manipulation
- Safe Mobile Manipulation (new benchmark)
- Tabletop Rearrangement
Benchmarks
- Safe Mobile Manipulation benchmark

