XSkill: Cross Embodiment Skill Discovery

Mengda Xu¹², Zhenjia Xu¹, Cheng Chi¹, Manuela Veloso²³, Shuran Song¹

7th Conference on Robot Learning (CoRL 2023)
¹Columbia University
²JP Morgan AI research
³Carnegie Mellon University

Abstract

Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework

XSkill Overview

One shot learning from human prompt video

We display both the human prompt video (left) and the robot execution results (right) during inference time. We visualize the projection of the encoded skill by the temporal skill encoder on the human prompt video. As for the robot execution results, we visualize the projection of the predicted skill from the Skill Alignment Transformer (SAT). SAT enhances the robustness of XSkill to the demonstration speed and enables adaptive adjustment of skills based on the current state of the robot and the task progress.

Drawer-Light-Cloth-Oven

Skill Identification: The plot visualizes the projection values of skill representations extracted from a human prompt video onto a set of skill prototypes. The skill representation is obtained through the temporal skill encoder, and the projection values can be interpreted as probabilities associated with specific skills.

Skill Alignment and Skill execution: The plot visualizes the projection values of skill representation predicted by the Skill Alignment Transformer (SAT). The skill-conditioned imitation learning policy will condition on the predicted skill representation and the current state to execute robot actions. In this visualization, the robot reattempt to grasp the cloth several times during the exection to complete this sub-task. Notice, the SAT predicts the skill of "grasp the cloth" (Pink/proto 29) during the reattempt period.

Drawer-Cloth-Light-Oven