Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
Given a video stream and a language instruction, TrackVLA++ predicts a tracking trajectory by utilizing Polar-CoT reasoning to infer the target's position and continuously updating the Target Identification Memory with CoT-based predictions for long-horizon tracking.
TrackVLA++ achieves sustained, long-horizon tracking in complex urban environments, maintaining continuity for over 30 minutes along multi-kilometer trajectories. It remains robust to heavy distractors, multiple barriers, and pronounced illumination changes, demonstrating stable performance in diverse and dynamic settings.
TrackVLA++ maintains target identity across complex, unstructured environments, including narrow crosswalks, barrier-dense areas, and wooded scenes; it handles large heading changes, traverses tight corridors into buildings, and continues through constrained spaces such as elevators.
TrackVLA++ exhibits robust performance in crowded scenes with severe occlusions and target disappearance, demonstrating strong spatiotemporal reasoning, consistent re-identification, and long-term memory. It is robust to dynamic distractors common in urban walkways—such as pedestrian flows, crosswalk traffic, and heterogeneous motions—sustaining stable tracking on sidewalks, crosswalks, and related urban settings.
TrackVLA++ leverages multi-camera inputs to recognize targets under off-axis views (i.e., non-frontal, oblique perspectives) and reorient toward the target, demonstrating strong spatial understanding and cross-view reasoning capability.
TrackVLA++ exhibits cross-domain generalization, enabling robust tracking across diverse scene styles, viewpoints, and camera parameters, while also generalizing beyond human subjects to a broad spectrum of objects, all without additional adaptation.