1 Peking University
2 Xi’an Jiaotong University
3 Alibaba XR Lab
4 Simon Fraser University
* equal contributions
† corresponding author
Figure 1. Framework overview. From the left to right: we leverage domain randomization-enhanced depth simulation to generate paired data, on which we can train our depth restoration network SwinDRNet, and the restored depths will be fed to downstream tasks and improves estimating category-level pose and grasping for specular and transparent objects.
Commercial depth sensors usually generate noisy and missing depths, especially on specular and transparent objects, which poses critical issues to downstream depth or point cloud-based tasks. To mitigate this problem, we propose a powerful RGBD fusion network, SwinDRNet, for depth restoration. We further propose Domain Randomization-Enhanced Depth Simulation (DREDS) approach to simulate an active stereo depth system using physics-based rendering and generate a large-scale synthetic dataset that contains 130K photorealistic RGB images along with their simulated depths carrying realistic sensor noises. To evaluate depth restoration methods, we also curate a real-world dataset, namely STD, that captures 30 cluttered scenes composed of 50 objects with different materials from specular, transparent, to diffuse. Experiments demonstrate that the proposed DREDS dataset bridges the sim-to-real domain gap such that, trained on DREDS, our SwinDRNet can seamlessly generalize to other real depth datasets, e.g. ClearGrasp, and outperform the competing methods on depth restoration with a real-time speed. We further show that our depth restoration effectively boosts the performance of downstream tasks, including category-level pose estimation and grasping tasks.
Figure 2. Qualitative comparison to state-of-the-art methods. We compare the proposed SwinDRNet with the state-of-the-art methods, LIDF and NLSPN. The figure shows the qualitative comparison of STD dataset, demonstrating that our method can predict a more accurate depth on the area with missing or incorrect values while preserving the depth value of the correct area of the raw depth map.
Figure 3. Qualitative results of pose estimations on DREDS and STD datasets. Here we shows the qualitative results of different experiments on DREDS and STD datasets. The ground truths are shown in green while the estimations are shown in red. We observe that the qualities of our predictions are generally better than others. The figure also shows that NOCS, SGPA and our method all perform better with the help of restoration depth, especially for specular and transparent objects like the mug, bottle and bowl, which indicates that depth restoration does help category-level pose estimation task.
Making use of domain randomization and depth simulation, we construct the large-scale simulated dataset, DREDS. In total, DREDS dataset consists of two subsets: 1) DREDS-CatKnown: 100,200 training and 19,380 testing RGBD images made of 1,801 objects spanning 7 categories from ShapeNetCore, with randomized specular, transparent, and diffuse materials, 2) DREDS-CatNovel: 11,520 images of 60 category-novel objects, which is transformed from GraspNet-1Billion that contains CAD models and annotates poses, by changing their object materials to specular or transparent, to verify the ability of our method to generalize to new object categories.
To further examine the proposed method in real scenes, we curate a real-world dataset, STD, composing of Specular, Transparent, and Diffuse objects. Similar to DREDS dataset, STD dataset contains: 1) STD-CatKnown: 27000 RGBD images of 42 category-level objects spanning 7 categories, captured from 25 different scenes with various backgrounds and illumination. 2) STD-CatNovel: 11000 data of 8 category-novel objects from 5 scenes, for evaluating the generalization ability of the proposed method.
we propose to synthesize depths with realistic sensor noise patterns by simulating an active stereo depth camera. Our simulator is built on Blender and leverages raytracing to mimic the IR stereo patterns and compute the depths from them. To facilitate generalization, we further adopt domain randomization techniques that randomize the object textures, object materials (from specular, transparent, to diffuse), object layout, floor textures, illuminations along camera poses.
Figure 4. Overview of SwinDRNet. To restore the noisy and incomplete depth, we propose a SwinTransformer based depth restoration network, namely SwinDRNet. SwinDRNet takes as input a RGB image Ic along with its aligned depth image Id and outputs a refined depth ˆId that restores the error area of the depth image and completes the invalid area. We, for the first time, devise a homogeneous and mirrored architecture that only leverages SwinT to extract and hierarchically fuse the RGB and depth features. We first extract the multi-scale features of RGB and depth image in the phase 1, respectively. Next, in the phase 2, our network fuse features of different modalities. Finally, generate the initial depth map and confidence maps by two decoders, and fuse the raw depth and initial depth using the predicted confidence map.
@inproceedings{dai2022dreds,
title={Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and Grasping Specular and Transparent Objects},
author={Dai, Qiyu and Zhang, Jiyao and Li, Qiwei and Wu, Tianhao and Dong, Hao and Liu, Ziyuan and Tan, Ping and Wang, He},
booktitle={European Conference on Computer Vision (ECCV)},
year={2022}
}
If you have any questions, please feel free to contact Qiyu Dai at qiyudai_at_pku_edu_cn, Jiyao Zhang at zhangjiyao_at_stu_xjtu_edu_cn, and He Wang at hewang_at_pku_edu_cn