FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real

FetchBot
*Indicates Equal Contribution   Indicates Corresponding Authors

FetchBot is capable of handling a variety of real-world shelf scenarios, including obstacles on one side, obstacles on both sides, transparent obstacles, and dynamic environments.

Abstract

Object fetching from cluttered shelves is an important capability for robots to assist humans in real-world scenarios. Achieving this task demands robotic behaviors that prioritize safety by minimizing disturbances to surrounding objects—an essential but highly challenging requirement due to restricted motion space, limited fields of view, and complex object dynamics. In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot generalizable and safety-aware object fetching from cluttered shelves in real-world settings. To address data scarcity, we propose an efficient voxel-based method for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware reinforcement learning (RL) policy to generate object fetching trajectories within these scenes. This RL policy, which leverages oracle information, is subsequently distilled into a vision-based policy for real-world deployment. Considering that sim-to-real discrepancies stem from texture variations mostly while from geometric dimensions rarely, we propose to adopt depth information estimated by full-fledged depth foundation models as the input for the vision-based policy to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel architecture for learning multi-view representations, allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to effectively minimize collisions while fetching objects from varying positions and depths, ensuring robust and safety-aware operation. Both simulation and real-robot experiments demonstrate FetchBot's superior generalization ability, particularly in handling a broad range of real-world scenarios, including diverse scene layouts and objects with varying geometries and dimensions.

Method Pipeline

FetchBot

The overview of FetchBot. In the (a) Sim Data Generation stage, we use UniVoxGen to generate a diverse set of scenes and employ a dynamics-aware RL policy to collect expert trajectories. In the (b) Pre-training stage, we first leverage a foundation model to mitigate the sim-to-real gap, then introduce an occupancy prediction task to learn a voxel-based representation. This task encourages the network to preserve essential geometric information and develop a comprehensive understanding of the scene. In the (c) Policy Training stage, we distill these expert trajectories into a vision-based policy through imitation learning. After training, the vision-based policy can zero-shot transfer to (d) Real World.

State-Based Policy

FetchBot

Scene Encoder Network in State-Based Policy. The network follows a hierarchical design. First, a local network is used to extract features specific to the individual objects within the scene. Then, a global network processes these features to capture the overall structure and context of the entire scene.

Vision-Based Policy

FetchBot

3D Vision Policy. The Perception module efficiently integrates features from multiple perspectives into a unified voxel-based representation, focusing solely on the region around the robotic gripper. The Decision module processes the features output by the Perception module using a transformer and employs a high-capacity Diffusion Policy as the core component for action generation.

Real-Robot Experiments (All videos are played at 1x speed)

Compared with Heuristic baseline in the real-world

Scenario: 1
Scenario: 2
Scenario: 3

Generalization to obstacle positions

The obstacle is on the left
The obstacle is on the right
The obstacle is on both sides

Generalization to target dimension

The target has a small dimension
The target has a medium dimension
The target has a large dimension

Generalization to the geometry of obstacles

Geometry type: 1
Geometry type: 2
Geometry type: 3

Generalization to obstacles and targets

Varying obstacle and target: 1
Varying obstacle and target: 2
Varying obstacle and target: 3

Dynamic environments

Dynamic environment: 1
Dynamic environment: 2

Additional real-robot experiments with more details

Generalization to obstacle positions

Generalization to target dimension

Generalization to the geometry of obstacles

Generalization to obstacles and targets

Dynamic environments

BibTeX

@misc{liu2025fetchbotobjectfetchingcluttered,
        title={FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real}, 
        author={Weiheng Liu and Yuxuan Wan and Jilong Wang and Yuxuan Kuang and Xuesong Shi and Haoran Li and Dongbin Zhao and Zhizheng Zhang and He Wang},
        year={2025},
        eprint={2502.17894},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2502.17894}, 
}