GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF

ICRA 2023

Qiyu Dai1,2*, Yan Zhu1*, Yiran Geng1, Ciyu Ruan3, Jiazhao Zhang1,2, He Wang1†

1 Peking University   2 Beijing Academy of Artificial Intelligence   3 National University of Defense Technology  

* equal contributions   corresponding author  


Figure 1. Overview of the proposed GraspNeRF and the dataset. Our method takes sparse multiview RGB images as input, constructs a neural radiance field, and executes material-agnostic grasp detection within 90ms. We train the model on the proposed large-scale synthetic multiview grasping dataset generated by photorealistic rendering and domain randomization.


In this work, we tackle 6-DoF grasp detection for transparent and specular objects, which is an important yet challenging problem in vision-based robotic systems, due to the failure of depth cameras in sensing their geometry. We, for the first time, propose a multiview RGB-based 6-DoF grasp detection network, GraspNeRF, that leverages the generalizable neural radiance field (NeRF) to achieve material-agnostic object grasping in clutter. Compared to the existing NeRF-based 3-DoF grasp detection methods that rely on densely captured input images and time-consuming per-scene optimization, our system can perform zero-shot NeRF construction with sparse RGB inputs and reliably detect 6-DoF grasps, both in real-time. The proposed framework jointly learns generalizable NeRF and grasp detection in an end-to-end manner, optimizing the scene representation construction for the grasping. For training data, we generate a large-scale photorealistic domain-randomized synthetic dataset of grasping in cluttered tabletop scenes that enables direct transfer to the real world. Our extensive experiments in synthetic and real-world environments demonstrate that our method significantly outperforms all the baselines in all the experiments while remaining in real-time.


Tasks and Results

Simulation Grasping Experiments


Figure 2. Success rates of single object retrieval in simulation. Our method achieves an average success rate of 86.1% over 36 trials, outperforming all the baselines.


Figure 3. Results of sequential clutter removal in simulation. Each cell: transparent and specular (mixed). Our method exhibits superior performance for sequential clutter removal, where we evaluate the methods on various objects with transparent and specular as well as mixed materials. Specifically, our method significantly improves the performance of transparent and specular object grasping compared to other methods.

Real Robot Experiments


Figure 4. Success rates of real-world single object retrieval. GraspNeRF obtains 88.9% of 18 trials and beats competing baselines.


Figure 5. Results of real-world sequential clutter removal. Each cell: transparent and specular (mixed). GraspNeRF achieves the highest grasping success rate and declutter rate in various materials in both pile and packed settings. Compared to NeRF-VGN (49 views), GraspNeRF only requires 6 views. Moreover, our method runs inference at 11 FPS without any optimization, which is significantly faster than NeRF-VGN with per-scene optimization. Importantly, our method successfully generalizes to new categories, as evidenced by the results for the two transparent objects (Glass gourd and small wine glass).



Figure 6. Overview of GraspNeRF. Our proposed framework is composed of a scene representation construction module and a volumetric grasp detection module. In scene representation construction module, we first extract and aggregate the multiview image features for two proposes: the extracted geometric features form a feature grid and will be passed to our TSDF prediction network to construct the TSDF of the scene, which encodes the scene geometry as well as the density of the underlying NeRF; at the same time, the features are used to predict color, which along with the density outputs enables the NeRF rendering. Taking the predicted TSDF as input, our volumetric grasp detection module then learns to predict 6-DoF grasps at each voxel of the TSDF.

Multiview 6-DoF Grasping Dataset


Figure 7. Examples of our dataset. Our data generation pipeline is motivated by DREDS. To construct the scenes, we transform an existing grasping dataset from GIGA that contains CAD models and annotates object poses, by changing their object materials to diffuse, specular or transparent. Then we render photorealistic multiview RGBs on Blender. To bridge the sim2real gap, we leverage domain randomization, which randomizes object materials and textures, backgrounds, illuminations, and camera poses. After training on the synthetic dataset with sufficient variations, the network considers real data as a variation of training data in testing time, so as to generalize to real.


  title={GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF},
  author={Qiyu Dai and Yan Zhu and Yiran Geng and Ciyu Ruan and Jiazhao Zhang and He Wang},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},


If you have any questions, please feel free to contact Qiyu Dai at qiyudai at pku dot edu dot cn, Yan Zhu at zhuyan_ at stu dot pku dot edu dot cn, and He Wang at hewang at pku dot edu dot cn