Our method provides a unified framework for handling multiple tasks, including Image QA, Video QA, and Navigation. We organize text tokens and visual tokens using temporal-viewpoint indicator tokens. For question answering, our model employs a conventional language modeling head in an autoregressive manner, while for navigation, it uses a planning head to directly predict trajectories.
For navigation, (a) our approach utilizes both coarse-grained and fine-grained visual tokens. (b) The navigation history is efficiently sampled under a fixed token budget using our Budget-Aware Temporal Sampling (BATS) method. (c) To distinguish historical information from different timesteps and viewpoints, we employ Temporal-Viewpoint Indicator (TVI) tokens, which encode both temporal and angular information.
we deploy our model on a remote server equipped with a GeForce RTX 5090 GPU and use the Internet for communication between the server and the client (which includes the controller and embodiments). Given a user instruction, the robots compress their current observations and transmit them to the server. The server then processes both the observations and the instruction to output a trajectory. This trajectory is subsequently processed by the local planner of each individual robot, which sends appropriate commands (\textit{e.g.}, velocity or joint controls) to drive the robot.
We sincerely thank Jianmin Wang and Wenhao Li for their support with the hardware setup. We also thank Chen Gao, Zhiyong Wang, Zhichao Hang, and Donglin Yang for their support with the experiments.