Overview
Production Readiness
0.35
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Robots or service agents that ask and recall before moving save time and energy. Halving navigation cost directly lowers operational expense and extends robot lifespan. Better trade-offs also reduce user annoyance from frequent interruptions.
Summary TLDR
ESearch-R1 teaches a multimodal LLM agent to trade off asking the user, checking episodic memory, and navigating so it avoids costly physical search. A new RL algorithm (HC-GRPO) trains the agent by sampling groups of reasoning trajectories and reinforcing those that achieve high information gain at low heterogeneous cost. In AI2-THOR simulation, the approach keeps or improves success while cutting average task execution cost by roughly half compared to strong ReAct baselines.
Problem Statement
Standard MLLM agents treat thinking and moving the same. In real robots movement is expensive and human interruptions have social cost. Agents often either brute-force search or over-question, wasting time and attention. The problem: learn a policy that explicitly trades off information gain vs heterogeneous costs (navigation, asking, memory) so the agent resolves ambiguity cheaply.
Main Contribution
ESearch-R1: a unified decision framework that treats Ask, GetMemory, and Navigate as actions with explicit costs.
HC-GRPO: a group-relative RL algorithm for MLLMs that removes the need for a learned value critic and optimizes reasoning trajectories for cost-aware behavior.
ESearch-Bench: a simulator benchmark (AI2-THOR) with ambiguous, partially observable tasks, a simulated user with fatigue, and automated task generation for evaluating cost-aware disambiguation.
Key Findings
ESearch-R1 cuts average operational cost by about half compared to a strong ReAct baseline.
Under high ambiguity (3–4 distractors), ESearch-R1 substantially raises success rate.
Interactive dialogue and episodic memory are essential for efficiency.
Results
Avg. Success Rate (SR)
Avg. Total Task Cost (TTC)
Success Rate under high ambiguity (3–4 distractors)
Success Weighted by Cost (SwC)
Who Should Care
What To Try In 7 Days
Implement a simple episodic memory (timestamped observations) and a one-question Ask action in your simulator.
Add per-action costs (navigate, ask, memory) to your reward and run short RL or policy search experiments to see behavior shifts.
Warm-start the policy with supervised CoT traces and then do a small group-sampling optimization loop like GRPO to prefer low-cost trajectories.
Agent Features
Memory
- Episodic memory retrieval (zero/low cost)
Planning
- Cost-aware planning with group-sampled reasoning trajectories
- GRPO
Tool Use
- Ask (user query)
- GetMemory (episodic retrieval)
- Navigate (semantic navigation primitive)
Frameworks
- GRPO
- ReAct (baseline)
- SFT
Is Agentic
true
Architectures
- Multimodal Large Language Model (MLLM) policy
- Chain-of-Thought reasoning traces
Collaboration
- Human-in-the-loop via Ask (simulated user with fatigue model)
Optimization Features
Token Efficiency
- GRPO
Infra Optimization
- Experiments run on 8 × NVIDIA H20 GPUs
Model Optimization
- GRPO
System Optimization
- Semantic navigation primitives (A* path planner) to isolate reasoning
Training Optimization
- Group-relative advantage estimation (no critic)
- SFT
Inference Optimization
- GRPO
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Results are in simulation (AI2-THOR); real sensors and occlusion may reduce performance.
- Cost function uses fixed coefficients; real deployments need dynamic or learned cost models.
- Inference latency of a 7B MLLM may be too high for some real-time applications.
When Not To Use
- Settings where perception is unreliable (occlusion, noisy sensors) without additional robustness.
- Tasks that cannot accept any human queries or memory logging.
- Edge hardware with strict latency and memory limits for large MLLMs.
Failure Modes
- Premature navigation after partial or vague user answers, leading to wrong picks.
- Over-reliance on simulated oracle behavior that mismatches real human responses.
- Suboptimal cost hyperparameters can lead to excessive questioning or wasted navigation.
Core Entities
Models
- Qwen2.5-VL-7B
- Qwen2.5-VL-32B
- Gemini-2.5-Pro
- Gemini-2.5-Flash
Metrics
- Success Rate (SR)
- Total Task Cost (TTC)
- Success Weighted by Cost (SwC)
- LLM-based Decision Quality Score
Datasets
- ESearch-Bench (AI2-THOR simulated tasks)
Benchmarks
- ESearch-Bench
- ObjectNav
- InstanceNav
Context Entities
Models
- SFT
- Gemini-1.5-Pro (used for task synthesis)
Metrics
- Mean return (reward minus cost)
- Average CoT length
Datasets
- Synthetic ambiguous instructions (LLM-generated)
Benchmarks
- TEACh, DialFRED (related interactive datasets)

