LDA -1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu*1,2 Kai Liu*2,3,4 Xuheng Zhang*1,2 Haoran Liao2,6 Yusen Feng1,2 Wenxuan Zhu1 Tingrui Shen1 Jiayi Chen1,2 Jiazhao Zhang1,2 Yifei Dong1 Wenbo Cui2,3,4 Senmao Qi2 Shuo Wang2 Yixin Zheng2,3,4 Mi Yan1,2 Xuesong Shi2 Haoran Li3 Dongbin Zhao3 Ming-Yu Liu7 Zhizheng Zhang2,† Li Yi5,† Yizhou Wang1,† He Wang1,2,†
* Equal contribution    † Corresponding author
1Peking University    2Galbot    3CASIA    4BAAI    5Tsinghua University    6Sun Yat-sen University    7NVIDIA
LDA-1B Model Overview

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by operating in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., π0.5) by up to 21%, 48%, and 23% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10% by leveraging 30% low-quality trajectories typically harmful and discarded.

Universal Embodied Data Ingestion

Model Architecture

LDA-1B Architecture

The model jointly denoises action chunks and predicts future visual features within a unified multi-modal diffusion transformer framework.

EI-30k Dataset

EI-30k Dataset

The dataset contains more than 30k hours of diverse human and robot interaction data. It spans varying episode lengths and manipulation tasks.

Latent Forward Dynamics Visualization

LDA-1B operates in a structured DINO latent space, which avoids redundant pixel-space appearance modeling and enables the model to focus on task-relevant features.

Original RGB videoLatent RepresentationModel prediction
Dexterous Manipulation
Original RGB videoLatent RepresentationModel prediction
Human Demonstration
Original RGB videoLatent RepresentationModel prediction
Pillow Task
Original RGB videoLatent RepresentationModel prediction
Rubbish Disposal

Scaling Analysis

Data Scaling: Universal Embodied Data Ingestion

Data Scaling Results

Co-training (blue) steadily reduces error with diverse data; Policy Only (grey) degrades when low-quality data is added. Our framework turns such noise into useful supervision.

Model Scaling: Stable Large-Scale Training

Model Scaling Results

LDA (blue) improves steadily; UWM (grey) saturates and degrades with more data. LDA-1B scales stably thanks to latent dynamics and the mixed-frequency transformer.

Real-World Results

Gripper (Galbot)
Success Rate Comparison on Real-World Gripper Manipulation Tasks
Eight tasks: pick-and-place, contact-rich, fine, and long-horizon. All models few-shot on Galbot. LDA-1B outperforms GR00T-N1.6 and π0.5.
Dexterous
Success Rate Comparison on Real-World Dexterous Manipulation Tasks
Five tasks: three on BrainCo (low-DoF), two on Sharpa (high-DoF). LDA-1B outperforms baselines, with largest gains on fine and high-DoF tasks.

Real-World Demonstrations

Gripper Manipulation on humanoid robot

Generic Pick and Place (8x speed)
Handover (4x speed)
Throw Rubbish (5x speed)
Flip Box (1x speed)
Water Flower (5x speed)
Sweep Table (8x speed)
Beat Block (1x speed)
Wipe Board (4x speed)

Dexterous Manipulation with SharpaWave Dexterous Hand

Clip Cake (1x speed)
Flip Bread (1.5x speed)
Glue Lid (4x speed)
Pick and Place Cake (1x speed)
Scoop Toy (2x speed)

Dexterous Manipulation with BrainCo Dexterous Hand

Pick and Place Bottle (1x speed)
Remove Nail (1x speed)
Open MacBook (1x speed)
Open Oven (1x speed)
Unscrew Cap (10x speed)
Lift Lid (1x speed)

Robot Platforms

Real-world Robot Platforms

Real-world robot platforms used in our physical experiments. From left to right: (1) Galbot G1 equipped with a standard two-finger parallel gripper for basic grasping tasks; (2) Galbot G1 fitted with the SharpaWave dexterous hand (22 DoF) for fine manipulation; (3) Unitree G1 mounted with the BrainCo dexterous hand (10 DoF) and a Zed Mini camera. This multi-platform setup demonstrates the generalization capability of our LDA model across diverse robot morphologies and end-effectors.

BibTeX

@article{lyu2025lda1b,
  title={LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion},
  author={Lyu, Jiangran and Liu, Kai and Zhang, Xuheng and Liao, Haoran and Feng, Yusen and Zhu, Wenxuan and Shen, Tingrui and Chen, Jiayi and Zhang, Jiazhao and Dong, Yifei and Wenbo, Cui and Qi, Senmao and Wang, Shuo and Zheng, Yixin and Yan, Mi and Shi, Xuesong and Li, Haoran and Zhao, Dongbin and Liu, Ming-Yu and Zhang, Zhizheng and Yi, Li and Wang, Yizhou and Wang, He},
  journal={arXiv preprint},
  year={2025}
}