From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.
Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.
our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects.
Constructing a Perceptual Reward Function
In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.
However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?
can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.
What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.
|Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.|
|objects_before_grasp - objects_after_grasp = grasped_object|
1. Object Similarity
The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.
The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.
In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.
Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.
This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.