Object fetching from cluttered shelves is an important capability for robots to assist
humans in
real-world scenarios.
Achieving this task demands robotic behaviors that prioritize safety by minimizing
disturbances
to
surrounding objects—an
essential but highly challenging requirement due to restricted motion space, limited fields
of
view, and
complex object dynamics.
In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot
generalizable
and safety-aware
object fetching from cluttered shelves in real-world settings. To address data scarcity, we
propose an
efficient voxel-based method
for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware
reinforcement
learning (RL) policy to generate
object fetching trajectories within these scenes. This RL policy, which leverages oracle
information, is
subsequently distilled into a
vision-based policy for real-world deployment. Considering that sim-to-real discrepancies
stem
from
texture variations mostly while from geometric
dimensions rarely, we propose to adopt depth information estimated by full-fledged depth
foundation models
as the input for the vision-based policy
to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel
architecture for
learning multi-view representations,
allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to
effectively
minimize collisions while fetching
objects from varying positions and depths, ensuring robust and safety-aware operation. Both
simulation and
real-robot experiments demonstrate
FetchBot's superior generalization ability, particularly in handling a broad range of
real-world
scenarios, including diverse scene layouts
and objects with varying geometries and dimensions.