LDA -1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu*^1,2 Kai Liu*^2,3,4 Xuheng Zhang*^1,2 Haoran Liao^2,6 Yusen Feng^1,2 Wenxuan Zhu¹ Tingrui Shen¹ Jiayi Chen^1,2 Jiazhao Zhang^1,2 Yifei Dong¹ Wenbo Cui^2,3,4 Senmao Qi² Shuo Wang² Yixin Zheng^2,3,4 Mi Yan^1,2 Xuesong Shi² Haoran Li³ Dongbin Zhao³ Ming-Yu Liu⁷ Zhizheng Zhang^2,† Li Yi^5,† Yizhou Wang^1,† He Wang^1,2,†

* Equal contribution † Corresponding author

¹Peking University ²Galbot ³CASIA ⁴BAAI ⁵Tsinghua University ⁶Sun Yat-sen University ⁷NVIDIA

Paper Code Data (Coming Soon)

Checkpoints (Coming Soon)

LDA-1B is a dynamics-centric robot foundation model trained on 30k+ hours of heterogeneous embodied data. It jointly learns dynamics, visual forecasting, and policy in a unified latent space, enabling scalable learning beyond BC.

Universal Embodied Data Ingestion

Unified Latent Dynamics and Policy Learning

The model jointly denoises action chunks and future DINO sequences within a unified multi-modal diffusion transformer framework. Heterogeneous data play distinct yet complementary roles for learning visual forecasting, dynamics learning and policy.

Large-scale Heterogeneous Embodied Data

We collect EI-30k, comprising more than 30k hours of diverse human and robot interaction data, which spans varying episode lengths and manipulation tasks.

Latent Forward Dynamics Visualization

Forward and inverse dynamics operate in a structured DINO latent space, which avoids redundant pixel-space appearance modeling and enables the model to focus on task-relevant dynamics features.

Original RGB videoLatent RepresentationModel prediction

Dexterous Manipulation

Original RGB videoLatent RepresentationModel prediction

Human Demonstration

Original RGB videoLatent RepresentationModel prediction

Pillow Task

Original RGB videoLatent RepresentationModel prediction

Rubbish Disposal

Scaling Analysis

Data Scaling: Universal Embodied Data Ingestion

LDA consistently degrades action prediction error from 5k to 30k hours data with cotraining all four tasks.

Model Scaling: Stable Large-Scale Training

LDA-1B scales stably thanks to latent dynamics and the mixed-frequency transformer.

Real-World Results

Gripper Manipulation

All models few-shot adapatation on Galbot. LDA-1B outperforms GR00T-N1.6 and π_0.5.

Dexterous Manipulation

Low and high-DoF dexterous manipulation.

Real-World Demonstrations

Gripper Manipulation

Pick and Place across Objects/Positions/Backgrounds (6x speed)

Pick & Handover & Place (3x speed)

Clean the Rubbish (4x speed)

Flip Box (1x speed)

Water Flower (6x speed)

Sweep the nails into the shovel (6x speed)

Knock the block with hammer (2x speed)

Wipe the Board (4x speed)

Dexterous Manipulation with High-DoF Dexterous Hand

Use a clamp to place object (1x speed)

Flip the bread over with a shovel (1.5x speed)

Use a glue gun to stick the lid (4x speed)

Pick and Place(1x speed)

Use a spoon to scoop (2x speed)

Dexterous Manipulation with Low-DoF Dexterous Hand

Pick and Place(1x speed)

Use a hammer to pull out the nail (1x speed)

Open the MacBook lid (1x speed)

Open the Oven (1x speed)

Unscrew the cap (10x speed)

Lift the lid Bimanually (1x speed)

Robot Platforms

(1) Galbot G1 equipped with a standard two-finger parallel gripper for basic grasping tasks; (2) Galbot G1 fitted with the SharpaWave dexterous hand (22 DoF) for fine manipulation; (3) Unitree G1 mounted with the BrainCo dexterous hand (10 DoF) and a Zed Mini camera.

BibTeX

@article{lyu2026lda,
  title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
  author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and others},
  journal={arXiv preprint arXiv:2602.12215},
  year={2026}
}