LDA -1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu*1,2 Kai Liu*2,3,4 Xuheng Zhang*1,2 Haoran Liao2,6 Yusen Feng1,2 Wenxuan Zhu1 Tingrui Shen1 Jiayi Chen1,2 Jiazhao Zhang1,2 Yifei Dong1 Wenbo Cui2,3,4 Senmao Qi2 Shuo Wang2 Yixin Zheng2,3,4 Mi Yan1,2 Xuesong Shi2 Haoran Li3 Dongbin Zhao3 Ming-Yu Liu7 Zhizheng Zhang2,† Li Yi5,† Yizhou Wang1,† He Wang1,2,†
* Equal contribution    † Corresponding author
1Peking University    2Galbot    3CASIA    4BAAI    5Tsinghua University    6Sun Yat-sen University    7NVIDIA
LDA-1B Model Overview
LDA-1B is a dynamics-centric robot foundation model trained on 30k+ hours of heterogeneous embodied data. It jointly learns dynamics, visual forecasting, and policy in a unified latent space, enabling scalable learning beyond BC.

Universal Embodied Data Ingestion

Unified Latent Dynamics and Policy Learning

LDA-1B Architecture

The model jointly denoises action chunks and future DINO sequences within a unified multi-modal diffusion transformer framework. Heterogeneous data play distinct yet complementary roles for learning visual forecasting, dynamics learning and policy.

Large-scale Heterogeneous Embodied Data

EI-30k Dataset

We collect EI-30k, comprising more than 30k hours of diverse human and robot interaction data, which spans varying episode lengths and manipulation tasks.

Latent Forward Dynamics Visualization

Forward and inverse dynamics operate in a structured DINO latent space, which avoids redundant pixel-space appearance modeling and enables the model to focus on task-relevant dynamics features.

Original RGB videoLatent RepresentationModel prediction
Dexterous Manipulation
Original RGB videoLatent RepresentationModel prediction
Human Demonstration
Original RGB videoLatent RepresentationModel prediction
Pillow Task
Original RGB videoLatent RepresentationModel prediction
Rubbish Disposal

Scaling Analysis

Data Scaling: Universal Embodied Data Ingestion

Data Scaling Results

LDA consistently degrades action prediction error from 5k to 30k hours data with cotraining all four tasks.

Model Scaling: Stable Large-Scale Training

Model Scaling Results

LDA-1B scales stably thanks to latent dynamics and the mixed-frequency transformer.

Real-World Results

Gripper Manipulation
Success Rate Comparison on Real-World Gripper Manipulation Tasks
All models few-shot adapatation on Galbot. LDA-1B outperforms GR00T-N1.6 and π0.5.
Dexterous Manipulation
Success Rate Comparison on Real-World Dexterous Manipulation Tasks
Low and high-DoF dexterous manipulation.

Real-World Demonstrations

Gripper Manipulation

Pick and Place across Objects/Positions/Backgrounds (6x speed)
Pick & Handover & Place (3x speed)
Clean the Rubbish (4x speed)
Flip Box (1x speed)
Water Flower (6x speed)
Sweep the nails into the shovel (6x speed)
Knock the block with hammer (2x speed)
Wipe the Board (4x speed)

Dexterous Manipulation with High-DoF Dexterous Hand

Use a clamp to place object (1x speed)
Flip the bread over with a shovel (1.5x speed)
Use a glue gun to stick the lid (4x speed)
Pick and Place(1x speed)
Use a spoon to scoop (2x speed)

Dexterous Manipulation with Low-DoF Dexterous Hand

Pick and Place(1x speed)
Use a hammer to pull out the nail (1x speed)
Open the MacBook lid (1x speed)
Open the Oven (1x speed)
Unscrew the cap (10x speed)
Lift the lid Bimanually (1x speed)

Robot Platforms

Real-world Robot Platforms

(1) Galbot G1 equipped with a standard two-finger parallel gripper for basic grasping tasks; (2) Galbot G1 fitted with the SharpaWave dexterous hand (22 DoF) for fine manipulation; (3) Unitree G1 mounted with the BrainCo dexterous hand (10 DoF) and a Zed Mini camera.

BibTeX

@article{lyu2026lda,
  title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
  author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and others},
  journal={arXiv preprint arXiv:2602.12215},
  year={2026}
}