Use retrieved introspective examples + conformal calibration so robots ask for clarification only when tasks are truly ambiguous

February 9, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

3

Authors

Kaiqu Liang, Zixu Zhang, Jaime Fernández Fisac

Links

Abstract / PDF

Why It Matters For Business

IntroPlan reduces unnecessary user queries and unsafe actions by aligning model uncertainty to task ambiguity; it improves action precision and safety while using modest extra compute and a small curated knowledge base.

Summary TLDR

The paper introduces IntroPlan: a retrieval-augmented planning pipeline that stores human-aligned, post-hoc reasoning examples (introspective rationales) in a small knowledge base and uses them to prompt LLM planners at runtime. When combined with conformal prediction (a statistical calibration method), this reduces unnecessary clarification requests while keeping a guaranteed probability that the correct action is included. Evaluations on three robot-style datasets (including a new Safe Mobile Manipulation benchmark) show large gains in precise prediction sets and safety metrics. The method works with modest KB sizes (≈100–200 entries) and off-the-shelf LLMs (GPT-3.5 / GPT-4 Turbo).

Problem Statement

LLM planners can hallucinate or be overconfident on ambiguous natural-language robot tasks. Robots need reliable uncertainty estimates so they either act safely or ask the right follow-up question. Existing retrieval and calibration methods either hallucinate when grounding is weak or become over-conservative and over-ask.

Main Contribution

Introspective planning: build a small knowledge base of LLM-generated, human-aligned post-hoc rationales and retrieve them as few-shot examples to make LLM planners reason about uncertainty and safety.

Integration of introspective retrieval with conformal prediction to tighten statistical coverage bounds and reduce unnecessary user queries while maintaining a coverage guarantee.

A new Safe Mobile Manipulation benchmark and safety-focused metrics (e.g., Unsafe Contamination Rate) to evaluate compliance and safety in LLM-based planners.

Key Findings

Direct IntroPlan (no conformal) yields much more precise prediction sets on Safe Mobile Manipulation with GPT-4.

NumbersSuccess Rate 96.5%, Exact Set Rate 93.0% (Table 1)

IntroPlan + conformal prediction keeps statistical coverage while reducing over-asking and contamination versus prior conformal baseline KnowNo.

NumbersConformal IntroPlan SR 87.5% vs KnowNo SR 84.5%; Exact Set Rate 58.0% vs 37.5%; Help Rate 63% vs 77.5% (Table 1)

Small knowledge bases are sufficient: performance saturates near 100–200 examples.

NumbersKB sizes of 100–200 give near-peak SR/ESR (Appendix C / Fig.8)

Results

Safe Mobile Manipulation — GPT-4 (Direct IntroPlan)

ValueSR 96.5%, ESR 93.0%, NCR 5.5%, UCR 0.5%, UR 0.5%

BaselineBest non-introspective baselines (Retrieval-Q-CoT) SR 88.0%, ESR 81.5%

Safe Mobile Manipulation — GPT-4 (IntroPlan + Conformal)

ValueSR 87.5% (target 85%), ESR 58.0%, HR 63.0%

BaselineKnowNo (Conformal) SR 84.5%, ESR 37.5%, HR 77.5%

KB size sensitivity

ValueGood performance with KB ≈100–200 examples

Baselinelarger KBs (200+) show diminishing returns

Who Should Care

What To Try In 7 Days

Build a 50–200 example KB of common tasks and post-hoc rationales for your robot workflow.

Add retrieved rationale examples to your LLM prompts and compare direct vs conformal outputs on a small safety test set.

Calibrate LLM confidence with a 400-instance calibration set and measure help rate and unsafe contamination.

Agent Features

Memory

  • Retrieval memory (small KB of reasoning examples)

Planning

  • Planning with LLMs
  • Introspective reasoning (post-hoc rationales)

Tool Use

  • Knowledge retrieval (SentenceBERT embeddings)
  • In-context few-shot prompting (retrieved rationales)
  • Conformal predictor for calibrated sets

Frameworks

  • RAG (retrieval-augmented generation)
  • Conformal prediction

Is Agentic

true

Architectures

  • LLM planner + retrieval KB
  • Conformal prediction wrapper

Collaboration

  • Ask user clarification when conformal set contains multiple valid options

Optimization Features

Token Efficiency

  • KB sizes kept modest (100–200) to control prompt token cost

Inference Optimization

  • Use top-m retrieval (m=3) to limit prompt size

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-label conformal prediction assumes mutually exclusive options; multi-label conformal attempt was conservative and underperformed.
  • Experiments use closed-source LLM APIs (GPT-3.5 / GPT-4), so behavior depends on those models and may vary with other LLMs.
  • No statistical error bars reported; results are averaged single runs due to API cost constraints.

When Not To Use

  • When you need provable multi-label calibration (paper shows single-label works better here).
  • If you cannot afford LLM API costs for retrieval and calibration prompts at inference scale.
  • If your task domain lacks clear human-labeled valid-option examples to build the KB.

Failure Modes

  • KB generation can inherit LLM hallucinations if ground-truth labels are noisy, producing misleading rationales.
  • Conformal sets can become overly conservative if calibration data does not match the test distribution.
  • Method may still include unsafe options if the LLM cannot reason about a specific safety rule not represented in the KB.

Core Entities

Models

  • GPT-4 Turbo (gpt-4-1106-preview)
  • GPT-3.5 (text-davinci-003)

Metrics

  • Success Rate
  • Help Rate
  • Exact Set Rate
  • Non-compliant Contamination Rate
  • Unsafe Contamination Rate
  • Overask Rate
  • Overstep Rate
  • Unsafe Rate

Datasets

  • Mobile Manipulation
  • Safe Mobile Manipulation (new benchmark)
  • Tabletop Rearrangement

Benchmarks

  • Safe Mobile Manipulation benchmark