NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-and-Language Navigation

ICRA 2025

Haoran Liu^1,2,* Weikang Wan^1,2,* Xiqian Yu^3,2,* Minghan Li^2,* Jiazhao Zhang^1,2 Bo Zhao⁴ Zhibo Chen³ Zhongyuan Wang⁵ Zhizheng Zhang^2,5,† He Wang^1,2,5,†

¹CFCS, School of Computer Science, Peking University ²Galbot ³University of Science and Technology of China ⁴Shanghai Jiao Tong University ⁵Beijing Academy of Artificial Intelligence

^* equal contributions ^† corresponding author

Paper

Abstract

Understanding and reasoning about the 4D spacetime is crucial for Vision-and-Language Navigation (VLN). However, previous works lack in-depth exploration in this aspect, resulting in bottlenecked spatial perception and action precision of VLN agents. In this work, we introduce NaVid4D, a Vision Language Model (VLM) based navigation agent taking the lead in explicitly showcasing the capabilities of spatial intelligence in the real world. Given natural language instructions, NaVid-4D requires only egocentric RGB-D video streams as observations to perform spatial understanding and reasoning for generating precise instruction-following robotic actions. NaVid-4D learns navigation policies using the data from simulation environments and is endowed with precise spatial understanding and reasoning capabilities using web data. Without the need to pre-train an RGB-D foundation model, we propose a method capable of directly injecting the depth features into the visual encoder of a VLM. We further compare the use of factually captured depth information with the monocularly estimated one and find NaVid-4D works well with both while using estimated depth offers greater generalization capability and better mitigates the sim-to-real gap. Extensive experiments demonstrate that NaVid-4D achieves state-of-the-art performance in simulation environment and makes impressive VLN performance with spatial intelligence happen in the real world.

4D spatio-temporal reasoning capabilities in vision language navigation. NaVid-4D can comprehend and reason about 3D spatial and 1D temporal relationships across diverse tasks and directly predict actions to follow given instructions. (The images with blue, grey, and yellow borders represent the start, middle, and end states of the task, respectively.)

Method

To explicitly take 3D input, we design a 3D-aware vision encoder which encodes RGB and depth images separately. We first separate the first 20 layers of ViT to encode RGB and depth images into an aligned feature space, then share the last 20 layers to further process them. We use the weight pretrained on large-scale RGB images to initialize the vision encoder and freeze only the RGB layers. In this way we can avoid the high-cost pretraining of RGB-D foundation models while still be able to encode the 3D information explicitly.

Experiment Results

With explicit depth input, our model can achieve better performance comparing to NaVid and other baselines. It can also be shown that the performance of using estimated depth for navigation data is comparable to the performance of using ground-truth captured depth in simulation. Since estimated depth has smaller sim-to-real gap, we choose to use it both in the training time and the inference time.