UrbanVLA: A Vision-Language-Action Model
for Urban Micromobility

Anqi Li^1,2,*, Zhiyong Wang^2,*, Jiazhao Zhang^1,2,*, Minghan Li²,
Yunpeng Qi^3,4, Zhibo Chen³, Zhizheng Zhang^2,4,†, He Wang^1,2,4,†

¹Peking University ²GalBot ³USTC ⁴BAAI

*Joint First Author †Equal Advising

arXiv VLA Model

UrbanVLA is a route-conditioned Vision-Language-Action model for urban micromobility. It aligns noisy navigation-tool routes with visual observations to enable scalable, long-horizon navigation. Trained via a two stage pipeline including SFT and RFT, UrbanVLA outperforms baselines by over 55% on MetaUrban and achieves robust real-world navigation across 500m+ routes.

Summary Video

UrbanVLA Pipeline

We collect diversified VideoQA data and urban micromobility demonstrations to train the model via a two-stage pipeline. In the SFT stage, UrbanVLA learns essential urban navigation capabilities such as goal-reaching, collision avoidance, and social compliance; in the RFT stage, we refine the model using a sim-real aggregated dataset with IQL, enhancing robustness in real-world scenarios.

MetaUrban Benchmark Results

The benchmark of PointNav and SocialNav tasks on the MetaUrban-12K dataset. We compare our method with seven strong baselines with LiDAR observation. The best and the second best results are denoted by bold and underline, respectively.

Our method achieves 94% SR / 0.91 SPL on PointNav and 97% SR / 0.95 SPL on unseen environments, demonstrating strong generalization capability. In SocialNav, it reaches a Social Navigation Score of 0.87, effectively avoiding collisions and maintaining social norms using only RGB inputs. These results demonstrate reliable and efficient navigation in complex urban scenarios.

Real-World Long-Horizon Navigation

MetaUrban Simulator Evaluation Visualizations

Real-world Deployment System

Our system consists of a quadruped robot equipped with GPS, Wi-Fi, a camera, and an onboard computing unit, along with a mobile-deployable console for real-time monitoring, sending navigation targets, visualizing maps and model predictions, and annotating teleoperation data used for reinforcement learning.

BibTeX

@article{li2025urbanvla,
  title={UrbanVLA: A Vision-Language-Action Model for Urban Micromobility},
  author={Li, Anqi and Wang, Zhiyong and Zhang, Jiazhao and Li, Minghan and Qi, Yunpeng and Chen, Zhibo and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2510.23576},
  year={2025}
}

UrbanVLA: A Vision-Language-Action Modelfor Urban Micromobility