NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-and-Language Navigation

ICRA 2025


Haoran Liu1,2,*    Weikang Wan1,2,*    Xiqian Yu3,2,*    Minghan Li2,*    Jiazhao Zhang1,2    Bo Zhao4    Zhibo Chen3    Zhongyuan Wang5    Zhizheng Zhang2,5,†    He Wang1,2,5,†

1CFCS, School of Computer Science, Peking University    2Galbot    3University of Science and Technology of China    4Shanghai Jiao Tong University    5Beijing Academy of Artificial Intelligence   

* equal contributions   corresponding author  

Abstract


Understanding and reasoning about the 4D spacetime is crucial for Vision-and-Language Navigation (VLN). However, previous works lack in-depth exploration in this aspect, resulting in bottlenecked spatial perception and action precision of VLN agents. In this work, we introduce NaVid4D, a Vision Language Model (VLM) based navigation agent taking the lead in explicitly showcasing the capabilities of spatial intelligence in the real world. Given natural language instructions, NaVid-4D requires only egocentric RGB-D video streams as observations to perform spatial understanding and reasoning for generating precise instruction-following robotic actions. NaVid-4D learns navigation policies using the data from simulation environments and is endowed with precise spatial understanding and reasoning capabilities using web data. Without the need to pre-train an RGB-D foundation model, we propose a method capable of directly injecting the depth features into the visual encoder of a VLM. We further compare the use of factually captured depth information with the monocularly estimated one and find NaVid-4D works well with both while using estimated depth offers greater generalization capability and better mitigates the sim-to-real gap. Extensive experiments demonstrate that NaVid-4D achieves state-of-the-art performance in simulation environment and makes impressive VLN performance with spatial intelligence happen in the real world.

4D spatio-temporal reasoning capabilities in vision language navigation. NaVid-4D can comprehend and reason about 3D spatial and 1D temporal relationships across diverse tasks and directly predict actions to follow given instructions. (The images with blue, grey, and yellow borders represent the start, middle, and end states of the task, respectively.)


Video




Method


To explicitly take 3D input, we design a 3D-aware vision encoder which encodes RGB and depth images separately. We first separate the first 20 layers of ViT to encode RGB and depth images into an aligned feature space, then share the last 20 layers to further process them. We use the weight pretrained on large-scale RGB images to initialize the vision encoder and freeze only the RGB layers. In this way we can avoid the high-cost pretraining of RGB-D foundation models while still be able to encode the 3D information explicitly.


Experiment Results


With explicit depth input, our model can achieve better performance comparing to NaVid and other baselines. It can also be shown that the performance of using estimated depth for navigation data is comparable to the performance of using ground-truth captured depth in simulation. Since estimated depth has smaller sim-to-real gap, we choose to use it both in the training time and the inference time.



Citation



  @inproceedings{liu2025navid4d,
    title={NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-and-Language Navigation},
    author={Liu, Haoran and Wan, Weikang and Yu, Xiqian and Li, Minghan and Zhang, Jiazhao and Zhao, Bo and Chen, Zhibo and Wang, Zhongyuan and Zhang, Zhizheng and Wang, He},
    booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)}
  }

License


This work and the set are licensed under CC BY-NC 4.0.


Contact


If you have any questions, please feel free to contact Haoran Liu at lhrrhl0419@stu.pku.edu.cn, Zhizheng Zhang at zhangzz@galbot.com, and He Wang at hewang@pku.edu.cn.