MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu1,*   Jiawei Chen1,*   Jiazhao Zhang1,2,*   Wenyao Zhang2, 3   Zekun Qi2, 4   Minghan Li2   Zhizheng Zhang2,5,†   He Wang1,2,5,†
1Peking University       2Galbot       3Shanghai Jiao Tong University       4Tsinghua University       5BAAI      
*Joint First Author      †Corresponding Author

Summary Video

We introduce MM-Nav, as a multi-view VLA (with 360° observation) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding.

Method

Our pipeline comprises two steps: (1) training of multiple RL experts with different capabilities and
initial VLA finetuning, (2) teachers-student online training iteration between RL experts and VLA.

teaser Image

Stage 1

We first train RL experts in different environments and collect
successful trajectories for VLA fine-tuning.

Stage 2

We then collect RL expert data online based on VLA observations
and dynamically balance the training data ratio.


The second stage is repeated iteratively, where the training data ratio is balanced dynamically.

Simulation Scenes

For each expert, we run 64 parallel robots in their respective simulation environments,
recording the observations, goals, and corresponding expert actions.

Reaching Scene

Squeezing Scene

Avoiding Scene