Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

Yixin Zheng^*1,2,3, Jiangran Lyu^*3,4,†, Yifan Zhang³, Jiayi Chen^3,4, Mi Yan^3,4, Yuntian Deng⁵,
Xuesong Shi³, Xiaoguang Zhao¹, Yizhou Wang⁴, Zhizheng Zhang^2,3,✉, He Wang^2,3,4,✉

¹CASIA ²BAAI ³Galbot ⁴Peking University ⁵Shanghai Jiaotong University

^*Equal contribution ^†Project Lead ✉ Corresponding authors

arXiv Code (In preparation) Benchmark (In preparation)

DAPL teaser — **Extrinsic dexterity** leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with coupled dynamics.

Dynamics-aware Policy Learning (DAPL)

DAPL use a two-stage learning framework. Stage 1 (world model pretraining): the model takes as input a cropped point cloud augmented with physical attributes (e.g., mass and velocity). A Transformer-based encoder maps these inputs into a dynamics representation $f_{dy}$, and a conditional MLP decoder predicts future per-point positions and velocities conditioned on robot actions (e.g., $a_{1:H_a}$). Stage 2 (policy learning): the pretrained dynamics features $f_{dy}$ are fed into an actor-critic policy together with proprioceptive observations and task goals, enabling efficient policy learning in a physics simulator.

Clutter6D Benchmark

Clutter6D defines three difficulty tracks based on clutter density: Sparse (4 objects), Moderate (8 objects), and Dense (12 objects). The Sparse track contains 1,024 training scenes, and each track includes 128 held-out scenes for evaluation.

Sparse 🍃

Moderate 🌿

Dense 🌳

Experiments

Training Efficiency and Convergence

Benchmark Results

Our method achieves higher sample efficiency than geometric baselines, reaching $\approx$$70\%$ success within the first $10^4$ iterations.

On Clutter6D, our method consistently outperforms all baselines, achieving 71.88% success in sparse scenes (vs. 46.63% for CORN) and 44.56% in dense scenes where our dynamics-aware method doubles the success rate of the strongest baseline (22.22%) in dense scenes.

Performance Demonstration

We evaluate our approach extensively in simulation on the Clutter6D benchmark and additionally test our model in real-world cluttered scenes.

Simulation

Real World

Visualization of World Model Predictions

We further visualize world model predictions and observe close agreement with ground-truth object motion in the visualization section, confirming the quality of learned dynamics.

Adaptive Behavior

By varying only object mass under identical configurations, our policy generates adaptive manipulation strategies that ensure successful interaction while minimizing environmental disturbance.

Combining Other Strategies

Integrated with the VLM planner, our policy empowers Galbot with critical pre-grasping skills. By leveraging extrinsic dexterity to reorient objects in cluttered shelves, it transforms ungraspable states into grasp-friendly configurations, overcoming reachability constraints.