Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

AAAI 2023 (Oral Presentation)


Jiayi Chen1, 2*    Mi Yan 1*    Jiazhao Zhang1    Yinzhen Xu1,2    Xiaolong Li3    Yijia Weng4    Li Yi5    Shuran Song6    He Wang1†

1CFCS, Peking University    2Beijing Institute for General AI    3Virginia Tech    4Stanford University    5Tsinghua University    6Columbia University

* equal contributions   corresponding author  


input

Figure 1. Left: We generate a large-scale hand-object interaction dataset, named SimGrasp, using simulated structure light based depth sensor. Right: Trained only on SimGrasp, our methods can be directly transferred to the challenging real world datasets, i.e. HO3D and DexYCB, to track and reconstruct hand object interaction.


Abstract


In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.


Methods


Full pipeline

input

Figure 2. At frame 0, We initialize the object shape OMobj represented in signed distance function (SDF) and the hand shape code β for the parametric model MANO, as shown in the dotted line. In the following tracking phase, at each frame t, we first separately estimate the object pose {Rtobj, Ttobj} and hand pose {Rthand, Tthand, θt}, and further refine the hand pose by taking hand-object interaction into consideration. We can also update object shape and hand shape every 10 frames, as shown in our supplementary materials.


HandTrackNet

input

Figure 3. HandTrackNet takes input the hand points Pt from the current frame t and the estimated hand joints Jt-1 from the last frame, and perform global pose canonicalization to both of the two. Then it leverage PointNet++ to extract features from canonicalized hand points Pt' and use each joint Jtcoarse' to query and pass features, followed by a MLP to regress and update joint positions.




SimGrasp Dataset


input

Figure 4. We synthesize a large-scale dynamic dataset SimGrasp with sufficient variations and realistic sensor noises.




Video




Qualitative results


HO3D

From left to right: input point cloud sequences, output overlay with RGB, output, output from another view. The speed of the video is consistent with the real time.

DexYCB

From left to right: input point cloud sequences, output overlay with RGB, output, output from another view. The speed of the video is consistent with the real time.



Citation


@article{chen2022tracking,
  title={Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild},
  author={Chen, Jiayi and Yan, Mi and Zhang, Jiazhao and Xu, Yinzhen and Li, Xiaolong and Weng, Yijia and Yi, Li and Song, Shuran and Wang, He},
  journal={arXiv preprint arXiv:2209.12009},
  year={2022}
}

Contact


If you have any questions, please feel free to contact Jiayi Chen at jiayichen_at_pku_edu_cn, Mi Yan at dorisyan_at_pku_edu_cn, and He Wang at hewang_at_pku_edu_cn