Real-Time AR Self-Expression with Machine Learning




Augmented reality (AR) helps you do more with what you see by overlaying digital content and information on top of the physical world. For example, AR features coming to Google Maps will let you find your way with directions overlaid on top of your real world. With Playground - a creative mode in the Pixel camera -- you can use AR to see the world differently. And with the latest release of YouTube Stories and ARCore's new Augmented Faces API you can add objects like animated masks, glasses, 3D hats and more to your own selfies!

One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world; a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.
Our 3D mesh and some of the effects it enables
To make all this possible, we employ machine learning (ML) to infer approximate 3D surface geometry to enable visual effects, requiring only a single camera input without the need for a dedicated depth sensor. This approach provides the use of AR effects at realtime speeds, using TensorFlow Lite for mobile CPU inference or its new mobile GPU functionality where available. This technology is the same as what powers YouTube Stories' new creator effects, and is also available to the broader developer community via the latest ARCore SDK release and the ML Kit Face Contour Detection API.

An ML Pipeline for Selfie AR
Our ML pipeline consists of two real-time deep neural network models that work together: A detector that operates on the full image and computes face locations, and a generic 3D mesh model that operates on those locations and predicts the approximate surface geometry via regression. Having the face accurately cropped drastically reduces the need for common data augmentations like affine transformations consisting of rotations, translation and scale changes. Instead it allows the network to dedicate most of its capacity towards coordinate prediction accuracy, which is critical to achieve proper anchoring of the virtual content.

Once the location of interest is cropped, the mesh network is only applied to a single frame at a time, using a windowed smoothing in order to reduce noise when the face is static while avoiding lagging during significant movement.
Our 3D mesh in action
For our 3D mesh we employed transfer learning and trained a network with several objectives: the network simultaneously predicts 3D mesh coordinates on synthetic, rendered data and 2D semantic contours on annotated, real world data similar to those MLKit provides. The resulting network provided us with reasonable 3D mesh predictions not just on synthetic but also on real world data. All models are trained on data sourced from a geographically diverse dataset and subsequently tested on a balanced, diverse testset for qualitative and quantitative performance.

The 3D mesh network receives as input a cropped video frame. It doesn't rely on additional depth input, so it can also be applied to pre-recorded videos. The model outputs the positions of the 3D points, as well as the probability of a face being present and reasonably aligned in the input. A common alternative approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth prediction and has high computational costs for so many points.

We further improve the accuracy and robustness of our model by iteratively bootstrapping and refining predictions. That way we can grow our dataset to increasingly challenging cases, such as grimaces, oblique angle and occlusions. Dataset augmentation techniques also expanded the available ground truth data, developing model resilience to artifacts like camera imperfections or extreme lighting conditions.
Dataset expansion and improvement pipeline
Hardware-tailored Inference
We use TensorFlow Lite for on-device neural network inference. The newly introduced GPU back-end acceleration boosts performance where available, and significantly lowers the power consumption. Furthermore, to cover a wide range of consumer hardware, we designed a variety of model architectures with different performance and efficiency characteristics. The most important differences of the lighter networks are the residual block layout and the accepted input resolution (128x128 pixels in the lightest model vs. 256x256 in the most complex). We also vary the number of layers and the subsampling rate (how fast the input resolution decreases with network depth).
Inference time per frame: CPU vs. GPU
The result of these optimizations is a substantial speedup from using lighter models, with minimal degradation in AR effect quality.
Comparison of the most complex (left) and the lightest models (right). Temporal consistency as well as lip and eye tracking is slightly degraded on light models.
The end result of these efforts empowers a user experience with convincing, realistic selfie AR effects in YouTube, ARCore, and other clients by:
  • Simulating light reflections via environmental mapping for realistic rendering of glasses
  • Natural lighting by casting virtual object shadows onto the face mesh
  • Modelling face occlusions to hide virtual object parts behind a face, e.g. virtual glasses, as shown below.
YouTube Stories includes Creator Effects like realistic virtual glasses, based on our 3D mesh
In addition, we achieve highly realistic makeup effects by:
  • Modelling Specular reflections applied on lips and
  • Face painting by using luminance-aware material 
Case study comparing real make-up against our AR make-up on 5 subjects under different lighting conditions.
We are excited to share this new technology with creators, users and developers alike, who can use this new technology immediately by downloading the latest ARCore SDK. In the future we plan to broaden this technology to more Google products.

Acknowledgements
We would like to thank Yury Kartynnik, Valentin Bazarevsky, Andrey Vakunov, Siargey Pisarchyk, Andrei Tkachenka, and Matthias Grundmann for collaboration on developing the current mesh technology; Nick Dufour, Avneesh Sud and Chris Bregler for an earlier version of the technology based on parametric models; Kanstantsin Sokal, Matsvei Zhdanovich, Gregory Karpiak, Alexander Kanaukou, Suril Shah, Buck Bourdon, Camillo Lugaresi, Siarhei Kazakou and Igor Kibalchich for building the ML pipeline to drive impressive effects; Aleksandra Volf and the annotation team for their diligence and dedication to perfection; Andrei Kulik, Juhyun Lee, Raman Sarokin, Ekaterina Ignasheva, Nikolay Chirkov, and Yury Pisarchyk for careful benchmarking and insights on mobile GPU-centric network architecture optimizations.