FetchBot is a sim-to-real framework achieving generalizable object fetching in cluttered scenes. Its systematic design enables policy generalization and sim-to-real transferability across diverse objects, varying layouts, and multiple end-effectors.
FetchBot is a sim-to-real framework achieving generalizable object fetching in cluttered scenes. Its systematic design enables policy generalization and sim-to-real transferability across diverse objects, varying layouts, and multiple end-effectors.
FetchBot Pipeline. In the (A) data generation stage, we use UniVoxGen to generate a diverse set of cluttered scenes and employ an RL-based Oracle policy to collect representative demonstrations. In the (B) 3D vision encoder pretraining stage, we first use the foundation model's predicted depth as an intermediate representation to mitigate the sim-to-real gap, then introduce an occupancy prediction task to learn a complete scene representation that can infer occluded regions. In the (C) vision policy training stage, we distill these expert demonstrations into a vision-based policy through imitation learning, which can achieve (D) zero-shot sim-to-real.
UniVoxGen efficiently generates realistic cluttered scenes by voxelizing objects and applying lightweight voxel operations—Union, Intersection, Difference, and Transformation. This enables the creation of a large-scale dataset of one million scenes, supporting oracle policy training and providing dense ground truth for occupancy prediction.
Generated cluttered scenes by UniVoxGen, including shelf, tabletop, drawer, and storage rack environments.
Scene Encoder Network.
The oracle policy leverages a hierarchical scene encoder to capture both local object details and global scene context, enabling robust representation of cluttered environments. Combined with a reward design that balances task success and minimal disturbance, it facilitates precise and efficient demonstration generation.
Vision-based imitation learning employs depth-based intermediate representations for robust sim-to-real transfer, ensuring consistent action predictions. To address occlusion, it integrates semantic occupancy prediction with multi-view depth features via deformable cross-attention, enabling efficient and generalizable scene understanding. The resulting 3D scene representation is transformed into executable actions through diffusion.
(A) RGB-D voxel method misses crucial geometric details due to occlusion, leading to collision during fetching. (B) Our method can infer the occluded region, enabling successful collision avoidance.
Our approach achieves strong sim-to-real performance, outperforming methods that rely on RGB (DP) or PointCloud (DP3) representations.
Storage Rack-1
Storage Rack-2
Cabinet-1
Cabinet-2
Dynamic Scenario-1
Dynamic Scenario-2
Tabletop Scenario
Drawer Scenario
Our method can also be extended to other tasks. Real-robot extension experiments showing successful fetching from cluttered tabletop (suction) and drawer (parallel gripper) settings.
Real-world occupancy reconstruction results, capable of handling diverse scenarios: varying object shapes, different materials, and complex layouts.
Comparison in the real world. Direct RGB-D voxelization often yields incomplete scenes in real-world settings, while our occupancy method (Occ) generates more complete representations.
Note: Importantly, we do not feed the reconstructed voxels (final occupancy map) to the downstream policy. Instead, the policy receives intermediate latent features from the pre-trained encoder; although it never observes the final occupancy prediction, these auxiliary-task–enriched features allow it to implicitly capture the scene’s complete 3D geometry.
@article{liu2025fetchbot,
title={FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real},
author={Liu, Weiheng and Wan, Yuxuan and Wang, Jilong and Kuang, Yuxuan and Shi, Xuesong and Li, Haoran and Zhao, Dongbin and Zhang, Zhizheng and Wang, He},
journal={arXiv preprint arXiv:2502.17894},
year={2025}
}
If you have any questions, please contact Weiheng Liu and Yuxuan Wan.