One of the most discussed features of the new iPhone X is the new unlocking method, the successor of TouchID: FaceID.
Having created a bezel-less phone, Apple had to develop a new method to unlock the phone in a easy and fast way. While some competitors continued using a fingerprint sensor, placed in a different position, Apple decided to innovate and revolutionize the way we unlock a phone: by simply looking at it. Thanks to an advanced (and remarkably small) front facing depth-camera, iPhone X in able to create a 3D map of the face of the user. In addition, a picture of the user’s face is captured using an infrared camera, that is more robust to changes in light and color of the environment. Using deep learning, the smartphone is able to learn the user face in great detail, thus recognizing him\her every time the phone is picked up by its owner. Surprisingly, Apple ha stated that this method is even safer than TouchID, with an outstanding error rate of 1:1,000,000.
I was very intrigued by the techniques used by Apple to realize FaceID, especially by the fact that this all runs on-device, with a little initial training on the user’s face, and then runs smoothly every time the phone is picked up. I focused on how to make this process work using deep learning, and how to optimize each step. In this post, I will show how I implemented a FaceID-like algorithm using Keras. I will explain the various architectural decision that I took, and show some final experiments, done using a Kinect, a very popular RGB and depth camera, that has a very similar output to iPhone X front facing cameras (but on a much bigger device). Sit comfortably, take a cup of coffee, and let’s start reverse engineering Apple’s new game changing feature.
“…the neural networks powering FaceID are not simply performing classification.”
The first step is analyzing carefully how FaceID works on the iPhone X. Their white paper can help us understand the basic mechanisms of FaceID. With TouchID, the user had to initially register his\her fingerprints by pressing several times the sensor. After around 15–20 different touches, the smartphone completed the registration, and TouchID was ready to go. Similarly, with FaceID the user has to register his\her face. The process is very simple: the user just looks at the phone as he\she would normally do, and then slowly rotates the head following a circle, thus registering the face from different poses. And that’s it, the process is complete and the phone is ready to be unlocked. This blazingly fast registration procedure can tell us a lot about the underlying learning algorithms. For instance, the neural networks powering FaceID are not just performing classifications, and I’ll explain why.
Performing classification, for a neural network, means learning to predict if the face it has seen it’s the users’s one or not. So, it should use some training data to predict “true” or “false”, basically, but differently from a lot of other deep learning use cases, here this approach would not work. First, the network should re-train from scratch using the new obtained data from the user’s face. This would require a lot of time, energy consumption, and impractical availability of training data of different faces to have negative examples (little would change in case of transfer learning and fine tuning of an already trained network). Furthermore, this method would not exploit the possibility, for Apple, to train a much more complex network “offline”, i.e. in their laboratories, and then ship it already trained and ready to use in their phones. Instead, I believe FaceID is powered by a siamese-like convolutional neural network that is trained “offline” by Apple to map faces into a low-dimensional latent space shaped to maximize distances between faces of different people, using a contrastive loss. What happens is that you get an architecture capable of doing one shot learning, as they very briefly mentioned at their Keynote. I know, there are some names that could not be familiar to many readers: keep reading, and I will explain step by step what I mean.
A siamese neural network is basically composed by two identical neural networks that also share all the weights. This architecture can learn to compute distances between particular kind of data, such as images. The idea is that you pass couples of data through the siamese networks (or simply pass the data in two different steps through the same network), the network maps it in a low dimensional feature space, like a n-dimensional array, and then you train the network to make this mapping so that data points from different classes are as far as possible, while data points from the same class are as close as possible. In the long run, the network will learn to extract the most meaningful features from data, and compress it into an array, creating an meaningful mapping. To have an intuitive understanding of this, imagine how you would describe dog breeds using a small vector, so that similar dogs have closer vectors. You would probably use a number to encode the fur color of the dog, another one to denote the size of the dog, another one for the length of fur, and so on. In this way, dogs that are similar to each other will have vectors that are similar to each other. Quite smart, right? Well, a siamese neural network can learn to do this for you, similarly to what an autoencoder does.
With this technique, one can use a great amount of faces to train such an architecture to recognize which faces are most similar. Having the right budget and computing power (as Apple does), one can also use harder and harder examples to make the network robust to things such as twins, adversarial attacks (masks) and so on. And what’s the final advantage of using this approach? That you finally have a plug and play model that can recognize different users without any further training, but simply computing where the user’s face is located in the latent map of faces after taking some pictures during the initial setup. (Imagine, as said before, to write down the vector of dog breeds for a new dog, and then storing it somewhere). In addition, FaceID is able to adapt to changes in your aspect: both sudden changes (e.g., glasses, hats, makeup) and slows changes (facial hair). This is done by basically adding reference face-vectors in this map, computed based on your new appearance.
Now, let’s finally see how to implement it in Python using Keras.
As for all machine learning projects, the first thing we need is data. Creating our own dataset would require time and the collaboration of many people, and this can be quite challenging. Thus, I browsed the web for a RGB-D face datasets, and I found one that looked like a perfect fit. It’s composed by a series of RGB-D pictures of people facing different directions and making different facial expressions, as it would happen in the iPhone X use case.
I created a convolutional network based on the SqueezeNet architecture. The network takes as input RGBD pictures of couples faces, so a 4 channel picture, and outputs a distance between the two embeddings. The network is trained with a constrastive loss, that minimizes distances between pictures of the same person and maximizes the distance between pictures of different persons.
After some training, the network is able to map faces into 128-dimensional arrays, such that pictures of the same person are grouped together, while being far from pictures of other persons. This means that, to unlock your device, the network just needs to compute the distance between the picture it takes during the unlocking with the pictures stored during the registration phase. If the distance is under a certain threshold, (the more little it is, the more secure it is) the device unlocks.
I used the t-SNE algorithm to visualize in two dimensions the 128-dimensional embedding space. Every color corresponds to a different person: as you can see, the network has learned to group those pictures quite tightly. (the distances between clusters are meaningless when using the t-SNE algorithm) An interesting plot also arises when using the PCA dimensionality reduction algorithm.
We can now try to see of this model works, simulating a usual FaceID cycle: first, the registration of the user’s face. Then, the unlocking phase, both from the user (that should succeed) that from other persons, that shouldn’t be able to unlock the device. As previously mentioned, the difference is between the distance that the network computes between the face that is unlocking the phone and the registered faces, and wether it is under a certain threshold or not.
Let’s start with the registration: I took a series of pictures of the same person from the dataset and simulated a registration phase. The device is now computing the embeddings for each of those poses, and storing them locally.
Let’s see now what happens if the same user tries to unlock the device. Different poses and facial expressions of the same user achieve a low distance, of around 0.30 on average.
On the other hand, RGBD pictures from different people get an average distance of 1.1.
So, using a threshold of around 0.4 should be sufficient to prevent strangers from unlocking your device.
In this post I showed how to implement a proof-of-concept of the FaceID unlocking mechanics, based on face embeddings and siamese convolutional networks. I hope you found it helpful, for any question you can get in touch with me. You can find here all the relative Python code.
Follow me on Twitter for updates on my work and more: https://twitter.com/normandipalo