GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

CVPR 2023 Highlight

Haoran Geng1, 2, 3*    Helin Xu4*     Chengyang Zhao1*    Chao Xu5    Li Yi4     Siyuan Huang3     He Wang1,2†   

1CFCS, Peking University    2School of EECS, Peking University    3Beijing Institute for General Artificial Intelligence    4Tsinghua University    5University of California, Los Angeles   

* equal contributions   corresponding author  


We propose to learn generalizable object perception and manipulation skills via Generalizable and Actionable Parts, and present GAPartNet, a large-scale interactive dataset with rich part annotations. We propose a domain generalization method for cross-category part segmentation and pose estimation. Our GAPart definition boosts cross-category object manipulation and can transfer to real.



For years, researchers have been devoted to generalizable object perception and manipulation, where crosscategory generalizability is highly desired yet underexplored. In this work, we propose to learn such crosscategory skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world.


GAPart Definition

We give a rigorous definition to the GAPart classes, which not only are Generalizable to visual recognition but also share similar Actionability, corresponding to the G and A in GAPartNet. Our main purpose of such a definition is to bridge the perception and manipulation, to allow joint learning of both vision and interaction. Accordingly, we propose two principles to follow: firstly, geometric similarity within part classes, and secondly, actionability alignment within part classes.

GAPartNet Dataset

Following the GAPart definition, we construct a large-scale part-centric interactive dataset, GAPartNet, with rich, part-level annotations for both perception and interaction tasks. Our 3D object shapes come from two existing datasets, PartNet-Mobility and AKB-48, which are cleaned and provided with new uniform annotations based on our GAPart definition. The final GAPartNet has 9 GAPart classes, providing semantic labels and pose annotations for 8,489 GAPart instances on 1,166 objects from 27 object categories. On average, each object has 7.3 functional parts. Each GAPart class can be seen on objects from more than 3 object categories, and each GAPart class is found in 8.8 object categories on average, which lays the foundation for our benchmark on generalizable parts.


Full pipeline


An Overview of Our Domain-generalizable Part Segmentation and Pose Estimation Method. We introduce a part-oriented domain adversarial training strategy that can tackle multi-resolution features and distribution imbalance for the domain-invariant GAPart feature extraction. The training strategy tackles the challenges in our tasks and dataset, significantly improving the generalizability of our method for part segmentation and pose estimation.


Cross-Category Part Segmentation

We visualize the results of the different methods on the seen and unseen categories, where the red blocks show the failures. Our method has fewer failure cases. In particular, on the unseen category, our method still achieves the desired performance when the performance of other methods drops significantly.


Cross-Category Part Pose Estimation

We visualize the results of the part pose estimation. Our method has better results on seen and unseen categories with better generalization across categories.


The following shows the visualization results for these two tasks. The left two figures show the results of cross-category part segmentation and pose estimation on seen and unseen categories, while the right figure shows some of the failure cases. Here we only show the revolute joint estimation results.


Cross-Category Part-based Object Manipulation

The following is a visualization of our part-based object manipulatio. Experiments in the simulator show that our approach enables more human-like interactions, while the RL algorithm from the ManiSkill benchmark often tries to open a door or drawer by prying and rubbing the edge of the door or drawer, rather than using the handle.


We further tested our method in real-world experiments, showing that our method is robust to Domain Gap, which allows us to generate reliable part segmentation and part pose estimation, and ultimately to successfully manipulate parts from unseen objects.




  title={GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts},
  author={Geng, Haoran and Xu, Helin and Zhao, Chengyang and Xu, Chao and Yi, Li and Huang, Siyuan and Wang, He},
  journal={arXiv preprint arXiv:2211.05272},


If you have any questions, please feel free to contact us:

  • Haoran Geng: ghrPrevent spamming@Prevent
  • Helin Xu: xuhelin1911Prevent spamming@Prevent
  • Chengyang Zhao: zhaochengyangPrevent spamming@Prevent
  • He Wang: hewangPrevent spamming@Prevent