UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

CVPR 2023


Yinzhen Xu*1, 2, 3    Weikang Wan*1, 2     Jialiang Zhang*1, 2    Haoran Liu*1, 2    Zikang Shan1     Hao Shen1     Ruicheng Wang1    Haoran Geng1, 2    Yijia Weng4    Jiayi Chen1    Tengyu Liu3    Li Yi5    He Wang†1, 2

1 Center on Frontiers of Computing Studies, Peking University     2 School of EECS, Peking University    3 Beijing Institute for General AI    4 Stanford University    5 Tsinghua University   

* equal contributions   corresponding author  


input

UniDexGrasp via grasp proposal generation and goal-conditioned execution. Left (grasp proposals): each figure shows for an object we generate diverse and high-quality grasp poses that vary greatly in rotation, translation and articulation states; right (grasp execution): Given two different grasp goal poses illustrated in the two bottom corners, we learn highly generalizable goal-conditioned grasping policy that can adaptively execute each corresponding goal pose, respectively shown in the green and blue trajectories.


Video




Abstract


In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen.

Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages:

  1. grasp proposal (pose) generation;
  2. goal-conditioned grasp execution.
For the first stage, we propose a novel probabilistic model of grasp pose conditioned on the point cloud observation that factorizes rotation modeling and translation and articulation modeling. Trained on our synthesized large-scale dexterous grasp dataset, this model enables us to sample diverse and high-quality dexterous grasp poses for the object in the point cloud. For the second stage, we propose to replace the motion planning used in parallel gripper grasping with a goal-conditioned grasp policy, due to the complexity involved in dexterous grasping execution. Note that it is very challenging to learn this highly generalizable grasp policy that only takes realistic inputs without oracle states.

We thus propose several important innovations, including state canonicalization, object curriculum, and teacher-student distillation. When integrating the two stages together, our final pipeline, for the first time, shows universal dexterous grasping on thousands of object instances with more than 60% success rate and significantly outperforms all baselines. Our experiments show minimal generalization gap between the seen and unseen instances, further demonstrating the universality of our method.


Methods


Full pipeline

input

Our main pipeline. The left part is the first stage, which generates a dexterous grasp proposal. The input is the object point cloud at time step 0, $X_0$, fused from depth images, with ground truth segmentation of the table and the object. A rotation $R$ is sampled from the distribution implied by the GraspIPDF, and the point cloud will be canonicalized by $R^{-1}$ to $\tilde{X}_0$. The GraspGlow then samples the translation $\tilde{\bm{t}}$ and joint angles $\bm{q}$. Next, the ContactNet takes $\tilde{X}_0$ and a point cloud $\tilde{X}_H$ sampled from the hand to predict the ideal contact map $\bm{c}$ on the object. Then, the predicted hand pose is optimized based on the contact information. The final goal pose is transformed by $R$ to align with the original visual observation. The right part is the second stage, the goal-conditioned dexterous grasping policy that takes the goal $\bm{g}$, point cloud $X_t$ and robot proprioception $\bm{s}^r_t$ to take actions accordingly.


Grasping Policy

input

The goal-conditioned dexterous grasping policy pipeline. $\widetilde{{\mathcal{S}}^{\mathcal{E}}}=(\widetilde{\bm{s}_r},\widetilde{\bm{s}_o},X_O,\widetilde{g})$ and $\widetilde{{\mathcal{S}}^{\mathcal{S}}}=(\widetilde{\bm{s}_r},\widetilde{X_S},\widetilde{g})$ denote the input state of the teacher policy and student policy after state canonicalization, respectively; $\oplus$ denotes concatenation.




Language-guided Dexterous Grasping


input

Qualitative results of language-guided grasp proposal selection. CLIP can select proposals complying with the language instruction, allowing goal-conditioned policy to execute potentially functional grasps.




Qualitative results


input
input


Citation



@article{xu2023unidexgrasp,
  title={UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy},
  author={Xu, Yinzhen and Wan, Weikang and Zhang, Jialiang and Liu, Haoran and Shan, Zikang and Shen, Hao and Wang, Ruicheng and Geng, Haoran and Weng, Yijia and Chen, Jiayi and others},
  journal={arXiv preprint arXiv:2303.00938},
  year={2023}
}


Contact


If you have any questions, please feel free to contact us:

  • Yinzhen Xu: xuyinzhen.hiPrevent spamming@Prevent spamminggmail.com
  • Weikang Wan: wwkPrevent spamming@Prevent spammingpku.edu.cn
  • Jialiang Zhang: jackzhang0906Prevent spamming@Prevent spamming126.com
  • Haoran Liu: lhrrhl0419Prevent spamming@Prevent spamming163.com
  • He Wang: hewangPrevent spamming@Prevent spammingpku.edu.cn