3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification

3D-Aware Object Goal Navigation
via Simultaneous Exploration and Identification

CVPR 2023

Jiazhao Zhang^1,2, Liu Dai^3, Fanpeng Meng⁴, Qingnan Fan⁵, Xuelin Chen⁵, Kai Xu⁶, He Wang^1†

¹ CFCS, Peking University ² BAAI ³ CEIE, Tongji University ⁴ Huazhong University of Science and Technology
⁵ Tencent AI Lab ⁶ National University of Defense Technology

^* equal contributions ^† corresponding author

Paper

Code

Teaser. We present a 3D-aware ObjectNav framework along with simultaneous exploration and identification policies: A$\rightarrow$B , the agent was guided by an exploration policy to look for its target; B$\rightarrow$ C , the agent consistently identified a target object and finally called STOP.

Abstract

Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation.

Methods

We take in a posed RGB-D image at time step $t$ and perform point-based construction algorithm to online fuse a 3D scene representation ($M_{3D}^{(t)}$), along with a $M_{2D}^{(t)}$ from semantics projection. Then, we simultaneously leverage two policies, including a corner-guided exploration policy $\pi_e$ and category-awre identification policy $\pi_f$, to predict a discrete corner goal $g_e^{(t)}$ and a target goal $g_f^{(t)}$ (if exist) respectively. Finally, the local planning module will drive the agent to the given target goal $g_f^{(t)}$ (top priority) or the corner goal $g_e^{(t)}$.

Online points fusion*

Left: A robot takes multi-view observations during navigation. Right: The points $p$ are organized by dynamically allocated blocks $B$ and per-point octrees $O$, which can be used to query neighborhood points of any given point.

Team


Jiazhao Zhang^1,2*	Liu Dai^3*	Fanpeng Meng⁴	Qingnan Fan⁵	Xuelin Chen⁵	Kai Xu⁶	He Wang^1†

¹ CFCS, Peking University ² BAAI ³ CEIE, Tongji University ⁴ Huazhong University of Science and Technology ⁵ Tencent AI Lab ⁶ National University of Defense Technology
^* equal contributions ^† corresponding author