DyWA: Dynamics-adaptive World Action Model
for Generalizable Non-prehensile Manipulation

1CFCS, School of Computer Science, Peking University, 2Galbot

indicates Corresponding Authors


After being trained in simulation, our policy achieves zero-shot sim-to-real transfer and generalizes across diverse dynamic properties, including variations in object geometry, object physical property (e.g., slipperiness and non-uniform mass distribution), and surface friction.


Abstract


Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive Action Model (DyWA) , a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.



Experiments

Simulation and Real-World Setups

We argue that a generalizable non-prehensile manipulation policy in a realistic robotic setting should not only accommodate diverse object geometries but also adapt to varying physical properties, all while relying solely on a single-camera setup without the need for additional tracking modules. Left: Simulation Setups. Right: Real-World Setups and 10 unseen objects for evaluation.

Simulation Rollouts

Generalize across Diverse Gemotries

Generalize across Different Table Frictions

Handling Objects with Non-uniform Mass Distribution

Combining with VLM and Grasping

Quantitative Results in the Real World

We evaluate our model's generalization ability by comparing it with CORN, which relies on an external tracking module for object pose estimation in real-world experiments. Our method achieves accurate manipulation across diverse objects without external pose tracking, significantly outperforming CORN with an average success rate of 68% versus 36%.


DyWA Methods


Our World Action Model processes the embeddings of the current observation (partial point cloud, end-effector pose, and joint state) and the goal point cloud (transformed from the initial partial observation) to predict the robot action and next state. Additionally, an adaptation module encodes historical observations and actions, decoding them into the dynamics embedding that conditions the model via FiLM. A pre-trained RL teacher policy (right) supervises both the action and adaptation embedding using privileged full point cloud and physics parameter embeddings.



Acknowledgments

We would like to give special thanks to Jiayi Chen for fruitful discussion and Junhao Yang for Blender Rendering.



BibTeX

@article{lyu2025dywa,
  title={DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation},
  author={Lyu, Jiangran and Li, Ziming and Shi, Xuesong and Xu, Chaoyi and Wang, Yizhou and Wang, He},
  journal={arXiv preprint arXiv:2503.16806},
  year={2025}
}