UniDexGrasp++

Abstract

We propose a novel, object-agnostic method for learning a universal policy for dexterous object grasping from realistic point cloud observations and proprioceptive information under a table-top setting, namely UniDexGrasp++. To address the challenge of learning the vision-based policy across thousands of object instances, we propose Geometry-aware Curriculum Learning (GeoCurriculum) and Geometry-aware iterative Generalist-Specialist Learning (GiGSL) which leverage the geometry feature of the task and significantly improve the generalizability. With our proposed techniques, our final policy shows universal dexterous grasping on thousands of object instances with 85.4% and 78.2% success rate on the train set and test set which outperforms the state-of-the-art baseline UniDexGrasp by 11.7% and 11.3%, respectively.

Methods

Overview

Overview. Our method, UniDexGrasp++ follows the convention of first learning a state-based policy and then distilling it into a vision-based policy, and our proposed method significantly boost both the state and vision learning stages.

GeoCurriculum

Geometry-aware Curriculum Learning. The idea is to gradually enlarge the grasping task space from a single object with a fixed pose, to similar objects with similar poses, finally to thousands objects with arbitrary poses. To do so, we propose to pretrain a point cloud autoencoder and use its bottleneck feature as the metric of the task space. We then can gradually enlarge the task space from one point to the whole space.

GeoClustering

Geometry-aware Clustering. After this curriculum training, we obtain our first generalist policy, SG1 and need to further improve it. Inspired by divide-and-conquer, we use the task metric to partition the whole task space into a lot of subspaces. And then we can duplicate SG1 many times and finetune each of them in a smaller task subspace to obtain many specialist SS_i. SS_i will have a better performance than SG1 in its task subspace.

GiGSL

Geometry-aware iterative Generalist-Specialist Learning. All the specialists SS_i now can distill back to a new generalist and gain a higher overall grasping performance. We iteratively do generalist and specialist learning several times until the policy is good enough.

Full pipeline

Method Overview. We propose to first adopt a state-based policy learning stage followed by a vision-based policy learning stage. The state-based policy takes input robot state Rt, object state St, and the geometric feature z of the scene point cloud of the first frame. We leverage a geometry-aware task curriculum (GeoCurriculum) to learn the first state-based generalist policy. After that, this generalist policy is further improved via iteratively performing specialist fine-tuning and distilling back to the generalist in our proposed geometry-aware iterative generalist-specialist learning (GiGSL), where the task assignment to which specialist is decided by our geometry-aware clustering (GeoClustering). For vision-based policy learning, we first distill the final state-based specialists to an initial vision-based generalist and then do GiGSL for the vision generalist, until we obtain the final vision-based generalist with the highest performance.

Results

Training Environment. We train our policies in IssacGym that contains more than 3000 training object instances with arbitrary initial poses.

Performance during Training. At the beginning of the training, geometry-aware task curriculum yields a state-based policy SG1 that achieves 82.7 success rate. Then the Geometry aware iterative Generalist-Specialist Learning keeps increasing the success rate. Note that each distillation will inevitably drop some performance, especially for the state-vision distillation due to significant difference in inputs. But finally, our vision-based learning iterations bring it back to an 85.4 success rate.

Visualization of Clusters. Here we visualize the task clusters in the vision stage. Look, each cluster groups similar objects with similar grasping poses together. Note that, usually, after distillation, finetuning using RL will only hurt performance. But, in our task cluster with high task similarity, RL finetuning can further improve the performance. This is the key and secret to our specialist learning.

Visualization of Grasping. Here we show some grasping processes of different objects with different pose.

UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-aware Curriculum and Iterative Generalist-Specialist Learning

ICCV 2023 Best Paper Finalist

Weikang Wan* Haoran Geng* Yun Liu Zikang Shan Yaodong Yang Li Yi He Wang†

^* equal contributions ^† corresponding author

Abstract

Methods

Overview

GeoCurriculum

GeoClustering

GiGSL

Full pipeline

Results

Citation

Contact

UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-aware Curriculum and Iterative Generalist-Specialist Learning

ICCV 2023 Best Paper Finalist

Weikang Wan* Haoran Geng* Yun Liu Zikang Shan Yaodong Yang Li Yi He Wang† * equal contributions † corresponding author

Abstract

Methods

Overview

GeoCurriculum

GeoClustering

GiGSL

Full pipeline

Results

Citation

Contact

Weikang Wan* Haoran Geng* Yun Liu Zikang Shan Yaodong Yang Li Yi He Wang†

^* equal contributions ^† corresponding author