The model jointly denoises action chunks and predicts future visual features within a unified multi-modal diffusion transformer framework.
The dataset contains more than 30k hours of diverse human and robot interaction data. It spans varying episode lengths and manipulation tasks.
LDA-1B operates in a structured DINO latent space, which avoids redundant pixel-space appearance modeling and enables the model to focus on task-relevant features.
Co-training (blue) steadily reduces error with diverse data; Policy Only (grey) degrades when low-quality data is added. Our framework turns such noise into useful supervision.
LDA (blue) improves steadily; UWM (grey) saturates and degrades with more data. LDA-1B scales stably thanks to latent dynamics and the mixed-frequency transformer.
Real-world robot platforms used in our physical experiments. From left to right: (1) Galbot G1 equipped with a standard two-finger parallel gripper for basic grasping tasks; (2) Galbot G1 fitted with the SharpaWave dexterous hand (22 DoF) for fine manipulation; (3) Unitree G1 mounted with the BrainCo dexterous hand (10 DoF) and a Zed Mini camera. This multi-platform setup demonstrates the generalization capability of our LDA model across diverse robot morphologies and end-effectors.
@article{lyu2025lda1b,
title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and Wenbo, Cui and Qi, Senmao and Wang, Shuo and Zheng, Yixin and Yan, Mi and Shi, Xuesong and Li, Haoran and Zhao, Dongbin and Liu, Ming-Yu and Zhang, Zhizheng and Yi, Li and Wang, Yizhou and Wang, He},
journal={arXiv preprint},
year={2025}
}