FetchBot

Learning Generalizable Object Fetching
in Cluttered Scenes via Zero-Shot Sim2Real

Conference on Robot Learning 2025 (Oral Presentation)

Paper

arXiv

Video

Code (coming soon)

FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real

Weiheng Liu* Yuxuan Wan* Jilong Wang Yuxuan Kuang Wenbo Cui Xuesong Shi
Haoran Li Dongbin Zhao Zhizheng Zhang ^† He Wang ^†

CoRL 2025 (Oral Presentation)

PDF ArXiv

Video

OpenReview (coming soon)

Code (coming soon)

Overview

FetchBot is a sim-to-real framework achieving generalizable object fetching in cluttered scenes. Its systematic design enables policy generalization and sim-to-real transferability across diverse objects, varying layouts, and multiple end-effectors.

Summary Video

FetchBot

Method Pipeline

FetchBot Pipeline. In the (A) data generation stage, we use UniVoxGen to generate a diverse set of cluttered scenes and employ an RL-based Oracle policy to collect representative demonstrations. In the (B) 3D vision encoder pretraining stage, we first use the foundation model's predicted depth as an intermediate representation to mitigate the sim-to-real gap, then introduce an occupancy prediction task to learn a complete scene representation that can infer occluded regions. In the (C) vision policy training stage, we distill these expert demonstrations into a vision-based policy through imitation learning, which can achieve (D) zero-shot sim-to-real.

Voxel-based Cluttered Scene Generator

Dynamics-Aware Oracle Policy

Scene Encoder Network.

The oracle policy leverages a hierarchical scene encoder to capture both local object details and global scene context, enabling robust representation of cluttered environments. Combined with a reward design that balances task success and minimal disturbance, it facilitates precise and efficient demonstration generation.

Vision-based Imitation Learning

Vision-based imitation learning employs depth-based intermediate representations for robust sim-to-real transfer, ensuring consistent action predictions. To address occlusion, it integrates semantic occupancy prediction with multi-view depth features via deformable cross-attention, enabling efficient and generalizable scene understanding. The resulting 3D scene representation is transformed into executable actions through diffusion.

Simulation Results

Parallel Simulation

Diverse Scenes

Occupancy Reconstruction (Top-Down View)

(A) RGB-D voxel method misses crucial geometric details due to occlusion, leading to collision during fetching. (B) Our method can infer the occluded region, enabling successful collision avoidance.

Real World Results

Comparison With Baselines

Choose type:

FetchBot (Ours)

Heuristic

DP3

Our approach achieves strong sim-to-real performance, outperforming methods that rely on RGB (DP) or PointCloud (DP3) representations.

Generalize to Diverse Obstacles

Generalize to Diverse Environments

Storage Rack-1

Storage Rack-2

Cabinet-1

Cabinet-2

Dynamic Cases

Dynamic Scenario-1

Dynamic Scenario-2

Broader Application

Tabletop Scenario

Drawer Scenario

Our method can also be extended to other tasks. Real-robot extension experiments showing successful fetching from cluttered tabletop (suction) and drawer (parallel gripper) settings.

Occ Prediction Results

Real-world occupancy reconstruction results, capable of handling diverse scenarios: varying object shapes, different materials, and complex layouts.

Occupancy vs. RGB-D

Comparison in the real world. Direct RGB-D voxelization often yields incomplete scenes in real-world settings, while our occupancy method (Occ) generates more complete representations.

Note: Importantly, we do not feed the reconstructed voxels (final occupancy map) to the downstream policy. Instead, the policy receives intermediate latent features from the pre-trained encoder; although it never observes the final occupancy prediction, these auxiliary-task–enriched features allow it to implicitly capture the scene’s complete 3D geometry.

Additional Demos with Corresponding Observations

Try out FetchBot!

We will open-source FetchBot, which integrates UniVoxGen (an efficient scene generator), Oracle Policy (for data collection), Vision-based Policy (deployable on real robots), and dataset that includes cluttered scenes, expert demonstrations, and voxel data.

Our Team

Weiheng Liu ^1,3,4,*

Yuxuan Wan ^2,3,*

Jilong Wang ^2,3

Yuxuan Kuang ²

Wenbo Cui ^1,3,4

Xuesong Shi ⁴

Haoran Li ¹

Dongbin Zhao ¹

Zhizheng Zhang ^{3,4,^†}

He Wang^{2,3,4,^†}

¹Institute of Automation, Chinese Academy of Sciences,
²CFCS, School of Computer Science, Peking University,
³Galbot, ⁴Beijing Academy of Artificial Intelligence
* Equal contributions ^† Corresponding authors

@misc{liu2025fetchbotlearninggeneralizableobject,
      title={FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real},
      author={Weiheng Liu and Yuxuan Wan and Jilong Wang and Yuxuan Kuang and Wenbo Cui and Xuesong Shi and Haoran Li and Dongbin Zhao and Zhizheng Zhang and He Wang},
      year={2025},
      eprint={2502.17894},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.17894},
}

If you have any questions, please contact Weiheng Liu and Yuxuan Wan.