FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real

CoRL 2025 (Oral Presentation)

Overview

overview

FetchBot is a sim-to-real framework achieving generalizable object fetching in cluttered scenes. Its systematic design enables policy generalization and sim-to-real transferability across diverse objects, varying layouts, and multiple end-effectors.

Summary Video

FetchBot

Method Pipeline

framework

FetchBot Pipeline. In the (A) data generation stage, we use UniVoxGen to generate a diverse set of cluttered scenes and employ an RL-based Oracle policy to collect representative demonstrations. In the (B) 3D vision encoder pretraining stage, we first use the foundation model's predicted depth as an intermediate representation to mitigate the sim-to-real gap, then introduce an occupancy prediction task to learn a complete scene representation that can infer occluded regions. In the (C) vision policy training stage, we distill these expert demonstrations into a vision-based policy through imitation learning, which can achieve (D) zero-shot sim-to-real.

Voxel-based Cluttered Scene Generator

UniVoxGen

UniVoxGen efficiently generates realistic cluttered scenes by voxelizing objects and applying lightweight voxel operations—Union, Intersection, Difference, and Transformation. This enables the creation of a large-scale dataset of one million scenes, supporting oracle policy training and providing dense ground truth for occupancy prediction.

generated_scenes

Generated cluttered scenes by UniVoxGen, including shelf, tabletop, drawer, and storage rack environments.

Dynamics-Aware Oracle Policy

oracle_policy

Scene Encoder Network.

The oracle policy leverages a hierarchical scene encoder to capture both local object details and global scene context, enabling robust representation of cluttered environments. Combined with a reward design that balances task success and minimal disturbance, it facilitates precise and efficient demonstration generation.

Vision-based Imitation Learning

perception

Vision-based imitation learning employs depth-based intermediate representations for robust sim-to-real transfer, ensuring consistent action predictions. To address occlusion, it integrates semantic occupancy prediction with multi-view depth features via deformable cross-attention, enabling efficient and generalizable scene understanding. The resulting 3D scene representation is transformed into executable actions through diffusion.

Simulation Results

Occupancy Reconstruction (Top-Down View)

occ_sim

(A) RGB-D voxel method misses crucial geometric details due to occlusion, leading to collision during fetching. (B) Our method can infer the occluded region, enabling successful collision avoidance.

Real World Results

Comparison With Baselines

Choose type:
FetchBot (Ours)
Heuristic
DP
DP3

Our approach achieves strong sim-to-real performance, outperforming methods that rely on RGB (DP) or PointCloud (DP3) representations.

Generalize to Diverse Obstacles

Generalize to Diverse Environments

Storage Rack-1

Storage Rack-2

Cabinet-1

Cabinet-2

Dynamic Cases

Dynamic Scenario-1

Dynamic Scenario-2

Broader Application

Tabletop Scenario

Drawer Scenario

Our method can also be extended to other tasks. Real-robot extension experiments showing successful fetching from cluttered tabletop (suction) and drawer (parallel gripper) settings.

Occ Prediction Results

Occ Prediction Result

Real-world occupancy reconstruction results, capable of handling diverse scenarios: varying object shapes, different materials, and complex layouts.

Occupancy vs. RGB-D

Occ Compare

Comparison in the real world. Direct RGB-D voxelization often yields incomplete scenes in real-world settings, while our occupancy method (Occ) generates more complete representations.

Note: Importantly, we do not feed the reconstructed voxels (final occupancy map) to the downstream policy. Instead, the policy receives intermediate latent features from the pre-trained encoder; although it never observes the final occupancy prediction, these auxiliary-task–enriched features allow it to implicitly capture the scene’s complete 3D geometry.

Additional Demos with Corresponding Observations

Page 1 / 10

Try out FetchBot!

We will open-source FetchBot, which integrates UniVoxGen (an efficient scene generator), Oracle Policy (for data collection), Vision-based Policy (deployable on real robots), and dataset that includes cluttered scenes, expert demonstrations, and voxel data.

Our Team

1Institute of Automation, Chinese Academy of Sciences,
2CFCS, School of Computer Science, Peking University,
3Galbot, 4Beijing Academy of Artificial Intelligence
* Equal contributions Corresponding authors
@article{liu2025fetchbot,
    title={FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real},
    author={Liu, Weiheng and Wan, Yuxuan and Wang, Jilong and Kuang, Yuxuan and Shi, Xuesong and Li, Haoran and Zhao, Dongbin and Zhang, Zhizheng and Wang, He},
    journal={arXiv preprint arXiv:2502.17894},
    year={2025}
}

If you have any questions, please contact Weiheng Liu and Yuxuan Wan.