A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang^1,2 Kunyu Wang³ Shaoan Wang^1,2 Minghan Li² Haoran Liu^1,2 Songlin Wei^1,2 Zhongyuan Wang³ Zhizheng Zhang^2,3,† He Wang^1,2,3,†

¹Peking University ²GalBot ³Beijing Academy of Artificial Intelligence

†Equal Advising

Paper Code (soon)

Robotics: Science and Systems (RSS 2025)

Highlights

Embodied Navigation Task Unification: Uni-NaVid is a navigation generalist integrating various embodied navigation tasks into one model, including Vision-and-Language Navigation (VLN), Object Navigation (ObjectNav), Embodied Question Answering (EQA), and Human-following tasks.
Task Synergy Aginst Data Hungry: Uni-NaVid incorporates 3.6 million navigation samples spanning different tasks and achieves significant performance improvements across diverse navigation benchmarks through effective task synergy.
High Efficiency and Non-blocking Deployment: Uni-NaVid adopts an online token merging strategy for achieving about 5 Hz model inference, and preficts action for multiple future steps for enabling non-blocking deployment in the real world.
Impressive Sim-to-Real Results: Uni-NaVid can generate both actions and languages by combining simulated action data and Internet semantics for a unified cotraining. This exhibits impressive sim-to-real generalizability in real-world environments.

Performance on Compositional Navigation Tasks

We deploy Uni-Navid in real-world environments to complete compositional instructions for multiple navigation tasks.

Performance on Individual Navigation Task

We deploy Uni-Navid in real-world environments to complete instructions for individual navigation tasks.

Method Overview

The overview of Uni-NaVid. Our method takes only single-view RGB frames { x₁, · · ·, x_T } and a natural language instruction as input. For each frame, we extract 64 visual tokens using the vision encoder and then use online token merging to accelerate the model while retaining compact visual information. The merged tokens and instruction tokens are sent to the large language model to obtain actions for navigation or answers for embodied question-answering.

Data Collection (Navigation 3.6M + VQA 2.3M)

Vision-and-Language Navigation : requires the agent to comprehend language instructions in order to follow the described path. We collect 2.4M samples based on both VLN-CE R2R and VLN-CE RxR.

Object Goal Navigation: involves the agent navigating an environment to locate a specific object based on provided visual or linguistic cues. We gather 483k samples from datasets on Habitat Matterport 3D datase

Embodied Question Answering: requires the agent to navigate to the related area for question answering. We collect 240k video-action samples and 10k video-answering samples on the EQA dataset on Matterport 3D environments.

Human Following: requires the agent to track and follow a human target with a specific description in dynamic and crowded environment. We collect 544k human-following navigation samples.