1CFCS, Peking University
2Beijing Institute for General AI
3Virginia Tech
4Stanford University
5Tsinghua University
6Columbia University
* equal contributions
† corresponding author
Figure 1. Left: We generate a large-scale hand-object interaction dataset, named SimGrasp, using simulated structure light based depth sensor. Right: Trained only on SimGrasp, our methods can be directly transferred to the challenging real world datasets, i.e. HO3D and DexYCB, to track and reconstruct hand object interaction.
In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.
Figure 2. At frame 0, We initialize the object shape OMobj represented in signed distance function (SDF) and the hand shape code β for the parametric model MANO, as shown in the dotted line. In the following tracking phase, at each frame t, we first separately estimate the object pose {Rtobj, Ttobj} and hand pose {Rthand, Tthand, θt}, and further refine the hand pose by taking hand-object interaction into consideration. We can also update object shape and hand shape every 10 frames, as shown in our supplementary materials.
Figure 3. HandTrackNet takes input the hand points Pt from the current frame t and the estimated hand joints Jt-1 from the last frame, and perform global pose canonicalization to both of the two. Then it leverage PointNet++ to extract features from canonicalized hand points Pt' and use each joint Jtcoarse' to query and pass features, followed by a MLP to regress and update joint positions.
Figure 4. We synthesize a large-scale dynamic dataset SimGrasp with sufficient variations and realistic sensor noises.
From left to right: input point cloud sequences, output overlay with RGB, output, output from another view. The speed of the video is consistent with the real time.
From left to right: input point cloud sequences, output overlay with RGB, output, output from another view. The speed of the video is consistent with the real time.
@article{chen2022tracking,
title={Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild},
author={Chen, Jiayi and Yan, Mi and Zhang, Jiazhao and Xu, Yinzhen and Li, Xiaolong and Weng, Yijia and Yi, Li and Song, Shuran and Wang, He},
journal={arXiv preprint arXiv:2209.12009},
year={2022}
}
If you have any questions, please feel free to contact Jiayi Chen at jiayichen_at_pku_edu_cn, Mi Yan at dorisyan_at_pku_edu_cn, and He Wang at hewang_at_pku_edu_cn