MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu^1,* Jiawei Chen^1,* Jiazhao Zhang^1,2,* Wenyao Zhang^{2, 3} Zekun Qi^{2, 4} Minghan Li² Zhizheng Zhang^2,5,† He Wang^1,2,5,†

¹Peking University ²Galbot ³Shanghai Jiao Tong University ⁴Tsinghua University ⁵BAAI

*Joint First Author †Corresponding Author

arXiv Environments VLA Model

We introduce MM-Nav, as a multi-view VLA (with 360° observation) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding.

Method

Our pipeline comprises two steps: (1) training of multiple RL experts with different capabilities and
initial VLA finetuning, (2) teachers-student online training iteration between RL experts and VLA.

Stage 1

We first train RL experts in different environments and collect
successful trajectories for VLA fine-tuning.

Stage 2

We then collect RL expert data online based on VLA observations
and dynamically balance the training data ratio.

The second stage is repeated iteratively, where the training data ratio is balanced dynamically.

Simulation Scenes

For each expert, we run 64 parallel robots in their respective simulation environments,
recording the observations, goals, and corresponding expert actions.

Reaching Scene

Squeezing Scene

Avoiding Scene

BibTeX

@article{xu2025mm,
        title={MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning},
        author={Xu, Tianyu and Chen, Jiawei and Zhang, Jiazhao and Zhang, Wenyao and Qi, Zekun and Li, Minghan and Zhang, Zhizheng and Wang, He},
        journal={arXiv preprint arXiv:2510.03142},
        year={2025}
      }