We collect diversified VideoQA data and urban micromobility demonstrations to train the model via a two-stage pipeline. In the SFT stage, UrbanVLA learns essential urban navigation capabilities such as goal-reaching, collision avoidance, and social compliance; in the RFT stage, we refine the model using a sim-real aggregated dataset with IQL, enhancing robustness in real-world scenarios.
The benchmark of PointNav and SocialNav tasks on the MetaUrban-12K dataset. We compare our method with seven strong baselines with LiDAR observation. The best and the second best results are denoted by bold and underline, respectively.
Our method achieves 94% SR / 0.91 SPL on PointNav and 97% SR / 0.95 SPL on unseen environments, demonstrating strong generalization capability. In SocialNav, it reaches a Social Navigation Score of 0.87, effectively avoiding collisions and maintaining social norms using only RGB inputs. These results demonstrate reliable and efficient navigation in complex urban scenarios.
Our system consists of a quadruped robot equipped with GPS, Wi-Fi, a camera, and an onboard computing unit, along with a mobile-deployable console for real-time monitoring, sending navigation targets, visualizing maps and model predictions, and annotating teleoperation data used for reinforcement learning.
@article{li2025urbanvla,
title={UrbanVLA: A Vision-Language-Action Model for Urban Micromobility},
author={Li, Anqi and Wang, Zhiyong and Zhang, Jiazhao and Li, Minghan and Qi, Yunpeng and Chen, Zhibo and Zhang, Zhizheng and Wang, He},
journal={arXiv preprint arXiv:2510.23576},
year={2025}
}