Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed.
TrackVLA extends video-based VLM/VLA approaches by introducing a parallel prediction branch for both trajectory planning and target recognition. For trajectory planning, TrackVLA organizes online-captured video data, combining historical and current observations, and concatenates them with tracking instructions and a special tracking token. A diffusion transformer then decodes the output tokens from an LLM into waypoints. For recognition tasks, all video frames are encoded identically and processed in a conventional autoregressive manner.
To train our parallel-branch TrackVLA, we collect a total of 1.7M newly collected samples, including embodied visual tracking and video-based question-answering data.
TrackVLA is capable of long-horizon tracking in diverse and dynamic environments. It can effectively track targets over long distances while remaining robust against distractors.
We conducted a series of experiments to compare the tracking performance of TrackVLA with that of a state-of-the-art commercial tracking UAV based on a modular approach. As shown in the video, TrackVLA performs better in challenging scenarios such as target occlusion and fast motion, thanks to its powerful target reasoning capabilities.
TrackVLA is capable of reasoning about the environment, enabling it to autonomously recognize traversable areas, avoid obstacles, and generalize to fast-motion and low-illumination scenarios without requiring additional training data.
TrackVLA is capable of cross-domain generalization, enabling robust tracking across diverse scene styles, viewpoints, and camera parameters without additional adaptation.
@article{wang2025trackvla,
author = {Wang, Shaoan and Zhang, Jiazhao and Li, Minghan and Liu, Jiahang and Li, Anqi and Wu, Kui and Zhong, Fangwei and Yu, Junzhi and Zhang, Zhizheng and Wang, He},
title = {TrackVLA: Embodied Visual Tracking in the Wild},
journal = {arXiv pre-print},
year = {2025},
url = {http://arxiv.org/abs/2505.23189}
}