Embodied Navigation Foundation Model

Jiazhao Zhang1,2,*,   Anqi Li1,2,*,   Yunpeng Qi3,4,*,   Minghan Li2,*,   Jiahang Liu2,  
Shaoan Wang1,   Haoran Liu1,2,   Gengze Zhou5,   Yuze Wu6,   Xingxing Li6,   Yuxin Fan6,  
Wenjun Li6,   Zhibo Chen3,   Fei Gao6,7,   Qi Wu5,   Zhizheng Zhang2,4,†   He Wang1,2,4,†
1Peking University       2GalBot       3USTC       4BAAI      
5University of Adelaide       6Zhejiang University       7Differential Robotics      
*Joint First Author      †Equal Advising
teaser Image

NavFoM is a cross-embodiment and cross-task navigation model trained on 8 million samples encompassing quadrupeds, drones, wheeled robots, and vehicles, spanning tasks including vision-and-language navigation, object searching, target tracking, and autonomous driving.

NavFoM Pipeline

Pipeline Image

Our method provides a unified framework for handling multiple tasks, including Image QA, Video QA, and Navigation. We organize text tokens and visual tokens using temporal-viewpoint indicator tokens. For question answering, our model employs a conventional language modeling head in an autoregressive manner, while for navigation, it uses a planning head to directly predict trajectories.

NavFoM Forwarding for Navigation Tasks

Pipeline Image

For navigation, (a) our approach utilizes both coarse-grained and fine-grained visual tokens. (b) The navigation history is efficiently sampled under a fixed token budget using our Budget-Aware Temporal Sampling (BATS) method. (c) To distinguish historical information from different timesteps and viewpoints, we employ Temporal-Viewpoint Indicator (TVI) tokens, which encode both temporal and angular information.

Standard Test Cases(VLN)

Standard Test Cases(Tracking)

Standard Test Cases(ObjNav)

icon
VLN-CE RxR (Four Views)

icon
VLN-CE RxR (Single Front View)

icon
Tracking EVT-Bench Distracted Target (Four Views)

icon
ObjNav HM3D-OVON (Four Views)

icon
VLN OpenUAV (Four Views)

icon
Autonomus Driving nuScenes (Six Views)

icon
Autonomus Driving openScenes (Eight Views)

Real-world Deployment System

real_world Image

we deploy our model on a remote server equipped with a GeForce RTX 5090 GPU and use the Internet for communication between the server and the client (which includes the controller and embodiments). Given a user instruction, the robots compress their current observations and transmit them to the server. The server then processes both the observations and the instruction to output a trajectory. This trajectory is subsequently processed by the local planner of each individual robot, which sends appropriate commands (\textit{e.g.}, velocity or joint controls) to drive the robot.

Acknowleagemnt

We sincerely thank Jianmin Wang and Wenhao Li for their support with the hardware setup. We also thank Chen Gao, Zhiyong Wang, Zhichao Hang, and Donglin Yang for their support with the experiments.