PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations

CVPR 2023

Haoran Geng1, 2*    Ziming Li1,2*     Yiran Geng1, 2    Jiayi Chen1,3    Hao Dong1,2     He Wang1,2†   

1CFCS, Peking University    2School of EECS, Peking University    3Beijing Academy of Artificial Intelligence   

* equal contributions   corresponding author  


We introduce a large-scale cross-category part manipulation benchmark PartManip with diverse object datasets, realistic settings, and rich annotations. We propose a generalizable vision-based policy learning strategy and boost the performance of part-based object manipulation by a large margin, which can generalize to unseen object categories and novel objects in the real world.



Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generaliza- tion ability of the manipulation policy and achieve cross- category object manipulation. In this work, we build the first large-scale, part-based cross-category object manip- ulation benchmark, PartManip, which is composed of 11 object categories, 494 objects, and 1432 tasks in 6 task classes. Compared to previous work, our benchmark is also more diverse and realistic, i.e., having more objects and using sparse-view point cloud as input without oracle information like part segmentation. To tackle the difficul- ties of vision-based policy learning, we first train a state- based expert with our proposed part-based canonicaliza- tion and part-aware rewards, and then distill the knowledge to a vision-based student. We also find an expressive back- bone is essential to overcome the large diversity of different objects. For cross-category generalization, we introduce domain adversarial learning for domain-invariant feature extraction. Extensive experiments in simulation show that our learned policy can outperform other methods by a large margin, especially on unseen object categories. We also demonstrate our method can successfully manipulate novel objects in the real world.

PartManip Benchmark

By utilizing the definition of the generalizable and actionable part (GAPart) presented in GAPartNet, we build a benchmark for a comprehensive evaluation of the cross-category generalization policy learning approaches. GAParts are some kinds of parts that have similar geometry and similar interaction strategy across different object categories. For example, the handle on tables is often similar to those on safes, so we can regard the handle as a GAPart. The nature of GAPart ensures a general way for manipulation regardless of the object category, making it possible for cross-category generalization. We thus expect the manipulation policy trained on some object categories can generalize to other unseen object categories, and build the first benchmark for cross-category generalizable part manipulation policy learning.


Furthermore, our benchmark is more diverse and realistic than previous robotic manipulation benchmarks. Diversity indicates that we have more object instances and categories. Realism indicates that our observation space has less oracle information (i.e., part segmentation masks)


We have six classes of tasks:OpenDoor, OpenDrawer, CloseDoor, CloseDrawer, PressButton and GraspHandle. Although OpenDoor and OpenDrawer require grasping the handle first, GraspHandle differs from them because it contains another GAPart lid with more object categories.


Full pipeline


An Overview of Our Pipeline. We first train state-based expert policy using our proposed canonicalization to the part coordinate frame and the part-aware reward. We then use the learned expert to collect demonstrations for pre-training the vision-based policy by behavior cloning. After pre-training, we train the vision-based policy to imitate the state-based expert policy using DAgger. We also introduce several point cloud augmentation techniques to boost the generalization ability. For the vision backbone, we introduce 3D Sparse-UNet which has a large expression capability. Furthermore, we introduced an extra domain adversarial learning module for better cross-category generalization.


We use Isaac Gym as our simulator and most experiments are done in simulation. During training, we can parallelly train on multiple different objects in our simulators.

Quantitative results


This is the quantitative result of Opening Door. Our method outperforms other baselines by a large margins, especially in unseen categories.


The same goes for the opening drawer task. Our method still achieve state-of-the-art performance, especially in unseen categories.

Qualitative results


This qualitative results show our training results. Compared to previous works, our policy is part-aware and thus more similar to human behaviors. For example, our policy can use handles to open doors and drawers, as shown in our video.


We also apply the output action of our policy to the robotic arm both in the simulator and the real world. Note that the testing object is unseen during policy learning. Experiments show that our trained model can successfully transfer to the real world.


  title={PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations},
  author={Geng, Haoran and Li, Ziming and Geng, Yiran and Chen, Jiayi and Dong, Hao and Wang, He},
  journal={arXiv preprint arXiv:2303.16958},

  title={GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts},
  author={Geng, Haoran and Xu, Helin and Zhao, Chengyang and Xu, Chao and Yi, Li and Huang, Siyuan and Wang, He},
  journal={arXiv preprint arXiv:2211.05272},


If you have any questions, please feel free to contact us:

  • Haoran Geng: ghrPrevent spamming@Prevent
  • He Wang: hewangPrevent spamming@Prevent