icon

A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang1,2       Kunyu Wang2       Shaoan Wang1,2       Minghan Li2       Haoran Liu1,2 Songlin Wei1,2      Zhongyuan Wang2      Zhizheng Zhang2,3†      He Wang1,2,3,†
1Peking University       2GalBot       3Beijing Academy of Artificial Intelligence
†Equal Advising

示例图片

Highlights

  • Embodied Navigation Task Unification: Uni-NaVid is a navigation generalist integrating various embodied navigation tasks into one model, including Vision-and-Language Navigation (VLN), Object Navigation (ObjectNav), Embodied Question Answering (EQA), and Human-following tasks.
  • Task Synergy Aginst Data Hungry: Uni-NaVid incorporates 3.6 million navigation samples spanning different tasks and achieves significant performance improvements across diverse navigation benchmarks through effective task synergy.
  • High Efficiency and Non-blocking Deployment: Uni-NaVid adopts an online token merging strategy for achieving about 5 Hz model inference, and preficts action for multiple future steps for enabling non-blocking deployment in the real world.
  • Impressive Sim-to-Real Results: Uni-NaVid can generate both actions and languages by combining simulated action data and Internet semantics for a unified cotraining. This exhibits impressive sim-to-real generalizability in real-world environments.

teaser

Performance on Compositional Navigation Tasks

We deploy Uni-Navid in real-world environments to complete compositional instructions for multiple navigation tasks.

Performance on Individual Navigation Task

We deploy Uni-Navid in real-world environments to complete instructions for individual navigation tasks.

Abstract

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more.Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training \name, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

Method Overview

NaVid

The overview of Uni-NaVid. Our method takes only single-view RGB frames { x1, · · ·, xT } and a natural language instruction as input. For each frame, we extract 64 visual tokens using the vision encoder and then use online token merging to accelerate the model while retaining compact visual information. The merged tokens and instruction tokens are sent to the large language model to obtain actions for navigation or answers for embodied question-answering.

Data Collection (Navigation 3.6M + VQA 2.3M)

NaVid

Vision-and-Language Navigation : requires the agent to comprehend language instructions in order to follow the described path. We collect 2.4M samples based on both VLN-CE R2R and VLN-CE RxR.
Object Goal Navigation: involves the agent navigating an environment to locate a specific object based on provided visual or linguistic cues. We gather 483k samples from datasets on Habitat Matterport 3D datase
Embodied Question Answering: requires the agent to navigate to the related area for question answering. We collect 240k video-action samples and 10k video-answering samples on the EQA dataset on Matterport 3D environments.
Human Following: requires the agent to track and follow a human target with a specific description in dynamic and crowded environment. We collect 544k human-following navigation samples.

Vision-and-Language Navigation

VLN-CE RxR Val-Unseen

Object Goal Navigation

HM3D ObjectNav

Embodied Question Answering

MP3D-EQA

Human Following

BibTeX

@misc{zhang2024uninavid,
        title={Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks}, 
        author={Jiazhao Zhang and Kunyu Wang and Shaoan Wang and Minghan Li and Haoran Liu and Songlin Wei and Zhongyuan Wang and Zhizheng Zhang and He Wang},
        year={2024},
        journal = {arXiv preprint arXiv:2412.06224}
      }