UrbanVLA: A Vision-Language-Action Model
for Urban Micromobility

Anqi Li1,2,*,   Zhiyong Wang2,*,   Jiazhao Zhang1,2,*,   Minghan Li2,  
Yunpeng Qi3,4,   Zhibo Chen3,   Zhizheng Zhang2,4,†,   He Wang1,2,4,†
1Peking University       2GalBot       3USTC       4BAAI      
*Joint First Author      †Equal Advising
teaser Image

UrbanVLA is a route-conditioned Vision-Language-Action model for urban micromobility. It aligns noisy navigation-tool routes with visual observations to enable scalable, long-horizon navigation. Trained via a two stage pipeline including SFT and RFT, UrbanVLA outperforms baselines by over 55% on MetaUrban and achieves robust real-world navigation across 500m+ routes.

Summary Video

UrbanVLA Pipeline

Pipeline Image

We collect diversified VideoQA data and urban micromobility demonstrations to train the model via a two-stage pipeline. In the SFT stage, UrbanVLA learns essential urban navigation capabilities such as goal-reaching, collision avoidance, and social compliance; in the RFT stage, we refine the model using a sim-real aggregated dataset with IQL, enhancing robustness in real-world scenarios.

MetaUrban Benchmark Results

Benchmark Image

The benchmark of PointNav and SocialNav tasks on the MetaUrban-12K dataset. We compare our method with seven strong baselines with LiDAR observation. The best and the second best results are denoted by bold and underline, respectively.

Our method achieves 94% SR / 0.91 SPL on PointNav and 97% SR / 0.95 SPL on unseen environments, demonstrating strong generalization capability. In SocialNav, it reaches a Social Navigation Score of 0.87, effectively avoiding collisions and maintaining social norms using only RGB inputs. These results demonstrate reliable and efficient navigation in complex urban scenarios.

Real-World Long-Horizon Navigation






MetaUrban Simulator Evaluation Visualizations

Real-world Deployment System

Real-World Setup Image

Our system consists of a quadruped robot equipped with GPS, Wi-Fi, a camera, and an onboard computing unit, along with a mobile-deployable console for real-time monitoring, sending navigation targets, visualizing maps and model predictions, and annotating teleoperation data used for reinforcement learning.

BibTeX

@article{li2025urbanvla,
  title={UrbanVLA: A Vision-Language-Action Model for Urban Micromobility},
  author={Li, Anqi and Wang, Zhiyong and Zhang, Jiazhao and Li, Minghan and Qi, Yunpeng and Chen, Zhibo and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2510.23576},
  year={2025}
}