Open6DOR: Benchmarking Open-instruction 6-DoF
Object Rearrangement and A VLM-based Approach

Under Review


Yufei Ding1,2*    Haoran Geng1,2,4*    Chaoyi Xu2     Xiaomeng Fang3     Jiazhao Zhang3,4    Songlin Wei2,4     Qiyu Dai4     Zhizheng Zhang2     He Wang2,3,4†   

1School of Electrical Engineering and Computer Science, Peking University    2Galbot    3Beijing Academy of Artificial Intelligence    4CFCS, School of Computer Science, Peking University   

* equal contributions   corresponding author  


input

Open6DOR Benchmark and Real-world Experiments. We introduce a challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks, termed Open6DOR. Following this, we propose a zero-shot and robust method, Open6DORGPT, which proves effective in demanding simulation environments and real-world scenarios.


Video




Abstract


The integration of large-scale Vision-Language Models (VLMs) with embodied AI can greatly enhance the generalizability and the capacity to follow open instructions for robots. However, existing studies on object rearrangement are not up to full consideration of the 6-DoF requirements, let alone establishing a comprehensive benchmark. In this paper, we propel the pioneer construction of the benchmark and approach for table-top Open-instruction 6-DoF Object Rearrangement (Open6DOR). Specifically, we collect a synthetic dataset of 200+ objects and carefully design 2400+ Open6DOR tasks. These tasks are divided into the Position-track, Rotation-track, and 6-DoF-track for evaluating different embodied agents in predicting the positions and rotations of target objects. Besides, we also propose a VLM-based approach for Open6DOR, named Open6DOR-GPT, which empowers GPT-4V with 3Dawareness and simulation-assistance and exploits its strengths in generalizability and instruction-following for this task. We compare the existing embodied agents with our Open6DORGPT on the proposed Open6DOR benchmark and find that Open6DOR-GPT achieves the state-of-the-art performance. We further show the impressive performance of Open6DORGPT in diverse real-world experiments. Our constructed benchmark and method will be released upon paper acceptance.


Methods


Full pipeline

input

Method Overview. Open6DOR-GPT takes the RGB-D image and instruction as input and outputs the corresponding robot motion trajectory. Firstly, the preprocessing module extracts the object names and masks. Then, two modules simultaneously predict the position and rotation of the target object in a decoupled way. Finally, the planning module generates a trajectory for execution.



Rotation pipeline

input

Simulation-assisted Rotation Module. Firstly, a textured mesh is reconstructed from the single-view image of the target object. Then, we employ large-scale sampling to obtain multiple rotation samples. This sample set is then narrowed down through a simulationassisted filtering process to derive several stable pose categories. Finally, we generate rendered images of the pose candidates, from which GPT-4V selects the optimal goal rotation.




Results


input

input

Real-world Experiments. We ground Open6DOR-GPT in real-world settings and conduct various tasks as well as long-horizon highlighting its exceptional zero-shot generalization capability across challenging tasks.

Contact


If you have any questions, please feel free to contact us:

  • Yufei Ding: selinaPrevent spamming@Prevent spammingstu.pku.edu.cn
  • Haoran Geng: ghrPrevent spamming@Prevent spammingstu.pku.edu.cn
  • He Wang: hewangPrevent spamming@Prevent spammingpku.edu.cn