The model jointly denoises action chunks and future DINO sequences within a unified multi-modal diffusion transformer framework. Heterogeneous data play distinct yet complementary roles for learning visual forecasting, dynamics learning and policy.
We collect EI-30k, comprising more than 30k hours of diverse human and robot interaction data, which spans varying episode lengths and manipulation tasks.
Forward and inverse dynamics operate in a structured DINO latent space, which avoids redundant pixel-space appearance modeling and enables the model to focus on task-relevant dynamics features.
LDA consistently degrades action prediction error from 5k to 30k hours data with cotraining all four tasks.
LDA-1B scales stably thanks to latent dynamics and the mixed-frequency transformer.
(1) Galbot G1 equipped with a standard two-finger parallel gripper for basic grasping tasks; (2) Galbot G1 fitted with the SharpaWave dexterous hand (22 DoF) for fine manipulation; (3) Unitree G1 mounted with the BrainCo dexterous hand (10 DoF) and a Zed Mini camera.
@article{lyu2026lda,
title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and others},
journal={arXiv preprint arXiv:2602.12215},
year={2026}
}