NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang^1,2,* Kunyu Wang^2,* Rongtao Xu^2,3,* Gengze Zhou⁴ Yicong Hong⁵ Xiaomeng Fang² Qi Wu⁴ Zhizheng Zhang^6,† He Wang^1,2,†

¹ Peking University ²Beijing Academy of Artificial Intelligence ³CASIA ⁴University of Adelaide ⁵Australian National University ⁶GalBot ^*Indicates Equal Contribution, ^†Indicates Equal Advising.

Paper Video Code

Robotics: Science and Systems (RSS 2024)

Highlights

NaVid is the first video-based Vision-Language Model (VLM) for the task of vision-and-language navigation (VLN).
NaVid navigates in a human-like manner, requiring solely an on-the-fly video stream from a monocular camera as input, without the need for maps, odometers, or depth inputs.
NaVid incorporates 510K VLN video sequences from simulation environments and 763K real-world caption samples to achieve cross-scene generalization.
NaVid achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, and exhibits strong generalizability on unseen scenarios.

Sim-to-Real Demos: Simple Instruction VLN

In these demos, the agent navigates following relatively simple instructions, such as walking to a single landmark. NaVid demonstrates the ability to accurately distinguish differences in similar instructions and accordingly complete precise navigation behaviors.

Sim-to-Real Demos: Complex Instruction VLN

In these demos, the agent navigates according to complex instructions composed of multiple simple instructions in sequence. NaVid can accurately execute them in the correct order.

Summary Video

Method Overview

The overview of NaVid. The inputs of NaVid consist of the RGB frames from the online video observation {x₀, · · · , x_t} along with the human instruction I. For each frame, we use an observation encoder to extract the visual information with the instruction to obtain observation tokens, including, instruction-queried tokens (orange blocks) and instruction-agnostic tokens (blue blocks). At the current step t, the history frames and current frame x_t are encoded as observation tokens, with 4 and 64 instruction-agnostic tokens for history frames and current frames, respectively. Besides, our method obtains language tokens by a text encoder. Finally, split by the special tokens [HIS], [OBS], and [NAV], we concatenate the observation tokens and language tokens and send the tokens to the Vicuna-7B then obtain the next-step action.

Data Collection

We co-train NaVid using real-world caption data (763k) and simulated VLN data (510k). The simulated VLN data consists of 500k action planning samples and 10k instruction reasoning samples.