Train multimodal LLM agents to ask or recall before moving, halving physical search cost in simulation.

December 21, 20257 min

Overview

Production Readiness

0.35

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu, Wei Li, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang, Zhengyou Zhang

Links

Abstract / PDF

Why It Matters For Business

Robots or service agents that ask and recall before moving save time and energy. Halving navigation cost directly lowers operational expense and extends robot lifespan. Better trade-offs also reduce user annoyance from frequent interruptions.

Summary TLDR

ESearch-R1 teaches a multimodal LLM agent to trade off asking the user, checking episodic memory, and navigating so it avoids costly physical search. A new RL algorithm (HC-GRPO) trains the agent by sampling groups of reasoning trajectories and reinforcing those that achieve high information gain at low heterogeneous cost. In AI2-THOR simulation, the approach keeps or improves success while cutting average task execution cost by roughly half compared to strong ReAct baselines.

Problem Statement

Standard MLLM agents treat thinking and moving the same. In real robots movement is expensive and human interruptions have social cost. Agents often either brute-force search or over-question, wasting time and attention. The problem: learn a policy that explicitly trades off information gain vs heterogeneous costs (navigation, asking, memory) so the agent resolves ambiguity cheaply.

Main Contribution

ESearch-R1: a unified decision framework that treats Ask, GetMemory, and Navigate as actions with explicit costs.

HC-GRPO: a group-relative RL algorithm for MLLMs that removes the need for a learned value critic and optimizes reasoning trajectories for cost-aware behavior.

ESearch-Bench: a simulator benchmark (AI2-THOR) with ambiguous, partially observable tasks, a simulated user with fatigue, and automated task generation for evaluating cost-aware disambiguation.

Key Findings

ESearch-R1 cuts average operational cost by about half compared to a strong ReAct baseline.

NumbersTTC reduced from 3.3 to 1.6 (≈50%) vs ReAct Qwen2.5-VL-32B on ESearch-Bench

Under high ambiguity (3–4 distractors), ESearch-R1 substantially raises success rate.

NumbersSuccess Rate: ESearch-R1 60.0% vs ReAct (typical) ≈22% on high-ambiguity tasks

Interactive dialogue and episodic memory are essential for efficiency.

NumbersAblation: removing Ask drops SR to 10.5%; removing GetMemory lowers SR to 52.0% and raises cost (TTC 2.3)

Results

Avg. Success Rate (SR)

Value61.5%

BaselineReAct (Qwen2.5-VL-32B) 60.0%

Avg. Total Task Cost (TTC)

Value1.6

BaselineReAct (Qwen2.5-VL-32B) 3.3

Success Rate under high ambiguity (3–4 distractors)

Value60.0%

BaselineReAct (typical) ≈22%

Success Weighted by Cost (SwC)

Value0.59

BaselineReAct (Qwen2.5-VL-32B) 0.36

Who Should Care

What To Try In 7 Days

Implement a simple episodic memory (timestamped observations) and a one-question Ask action in your simulator.

Add per-action costs (navigate, ask, memory) to your reward and run short RL or policy search experiments to see behavior shifts.

Warm-start the policy with supervised CoT traces and then do a small group-sampling optimization loop like GRPO to prefer low-cost trajectories.

Agent Features

Memory

  • Episodic memory retrieval (zero/low cost)

Planning

  • Cost-aware planning with group-sampled reasoning trajectories
  • GRPO

Tool Use

  • Ask (user query)
  • GetMemory (episodic retrieval)
  • Navigate (semantic navigation primitive)

Frameworks

  • GRPO
  • ReAct (baseline)
  • SFT

Is Agentic

true

Architectures

  • Multimodal Large Language Model (MLLM) policy
  • Chain-of-Thought reasoning traces

Collaboration

  • Human-in-the-loop via Ask (simulated user with fatigue model)

Optimization Features

Token Efficiency

  • GRPO

Infra Optimization

  • Experiments run on 8 × NVIDIA H20 GPUs

Model Optimization

  • GRPO

System Optimization

  • Semantic navigation primitives (A* path planner) to isolate reasoning

Training Optimization

  • Group-relative advantage estimation (no critic)
  • SFT

Inference Optimization

  • GRPO

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Results are in simulation (AI2-THOR); real sensors and occlusion may reduce performance.
  • Cost function uses fixed coefficients; real deployments need dynamic or learned cost models.
  • Inference latency of a 7B MLLM may be too high for some real-time applications.

When Not To Use

  • Settings where perception is unreliable (occlusion, noisy sensors) without additional robustness.
  • Tasks that cannot accept any human queries or memory logging.
  • Edge hardware with strict latency and memory limits for large MLLMs.

Failure Modes

  • Premature navigation after partial or vague user answers, leading to wrong picks.
  • Over-reliance on simulated oracle behavior that mismatches real human responses.
  • Suboptimal cost hyperparameters can lead to excessive questioning or wasted navigation.

Core Entities

Models

  • Qwen2.5-VL-7B
  • Qwen2.5-VL-32B
  • Gemini-2.5-Pro
  • Gemini-2.5-Flash

Metrics

  • Success Rate (SR)
  • Total Task Cost (TTC)
  • Success Weighted by Cost (SwC)
  • LLM-based Decision Quality Score

Datasets

  • ESearch-Bench (AI2-THOR simulated tasks)

Benchmarks

  • ESearch-Bench
  • ObjectNav
  • InstanceNav

Context Entities

Models

  • SFT
  • Gemini-1.5-Pro (used for task synthesis)

Metrics

  • Mean return (reward minus cost)
  • Average CoT length

Datasets

  • Synthetic ambiguous instructions (LLM-generated)

Benchmarks

  • TEACh, DialFRED (related interactive datasets)