GAPartNet

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

CVPR 2023 Highlight

Haoran Geng^{1, 2, 3} Helin Xu^4 Chengyang Zhao^1* Chao Xu⁵ Li Yi⁴ Siyuan Huang³ He Wang^1,2†

¹CFCS, Peking University ²School of EECS, Peking University ³Beijing Institute for General Artificial Intelligence ⁴Tsinghua University ⁵University of California, Los Angeles

^* Equal contribution with the order determined by rolling dice
^† Corresponding author

Paper

Code

Dataset

Abstract

For years, researchers have been devoted to generalizable object perception and manipulation, where crosscategory generalizability is highly desired yet underexplored. In this work, we propose to learn such crosscategory skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world.

Dataset

GAPart Definition

We give a rigorous definition to the GAPart classes, which not only are Generalizable to visual recognition but also share similar Actionability, corresponding to the G and A in GAPartNet. Our main purpose of such a definition is to bridge the perception and manipulation, to allow joint learning of both vision and interaction. Accordingly, we propose two principles to follow: firstly, geometric similarity within part classes, and secondly, actionability alignment within part classes.

GAPartNet Dataset

Following the GAPart definition, we construct a large-scale part-centric interactive dataset, GAPartNet, with rich, part-level annotations for both perception and interaction tasks. Our 3D object shapes come from two existing datasets, PartNet-Mobility and AKB-48, which are cleaned and provided with new uniform annotations based on our GAPart definition. The final GAPartNet has 9 GAPart classes, providing semantic labels and pose annotations for 8,489 GAPart instances on 1,166 objects from 27 object categories. On average, each object has 7.3 functional parts. Each GAPart class can be seen on objects from more than 3 object categories, and each GAPart class is found in 8.8 object categories on average, which lays the foundation for our benchmark on generalizable parts.

Methods

Full pipeline

An Overview of Our Domain-generalizable Part Segmentation and Pose Estimation Method. We introduce a part-oriented domain adversarial training strategy that can tackle multi-resolution features and distribution imbalance for the domain-invariant GAPart feature extraction. The training strategy tackles the challenges in our tasks and dataset, significantly improving the generalizability of our method for part segmentation and pose estimation.

Results

Cross-Category Part Segmentation

We visualize the results of the different methods on the seen and unseen categories, where the red blocks show the failures. Our method has fewer failure cases. In particular, on the unseen category, our method still achieves the desired performance when the performance of other methods drops significantly.

Cross-Category Part Pose Estimation

We visualize the results of the part pose estimation. Our method has better results on seen and unseen categories with better generalization across categories.

The following shows the visualization results for these two tasks. The left two figures show the results of cross-category part segmentation and pose estimation on seen and unseen categories, while the right figure shows some of the failure cases. Here we only show the revolute joint estimation results.

Cross-Category Part-based Object Manipulation

The following is a visualization of our part-based object manipulatio. Experiments in the simulator show that our approach enables more human-like interactions, while the RL algorithm from the ManiSkill benchmark often tries to open a door or drawer by prying and rubbing the edge of the door or drawer, rather than using the handle.

We further tested our method in real-world experiments, showing that our method is robust to Domain Gap, which allows us to generate reliable part segmentation and part pose estimation, and ultimately to successfully manipulate parts from unseen objects.

Note that we did this real world experiments with KINOVA Arm.

Note that we did this real world experiments with Franka Arm and using the CONFIG MODE to control the robot.