Learning to Predict Depth on the Pixel 3 Phones

By Google



Portrait Mode on the Pixel smartphones lets you take professional-looking images that draw attention to a subject by blurring the background behind it. Last year, we described, among other things, how we compute depth with a single camera using its Phase-Detection Autofocus (PDAF) pixels (also known as dual-pixel autofocus) using a traditional non-learned stereo algorithm. This year, on the Pixel 3, we turn to machine learning to improve depth estimation to produce even better Portrait Mode results.
Left: The original HDR+ image. Right: A comparison of Portrait Mode results using depth from traditional stereo and depth from machine learning. The learned depth result has fewer errors. Notably, in the traditional stereo result, many of the horizontal lines behind the man are incorrectly estimated to be at the same depth as the man and are kept sharp.
(Mike Milne)
A Short Recap
As described in last year’s blog post, Portrait Mode uses a neural network to determine what pixels correspond to people versus the background, and augments this two layer person segmentation mask with depth information derived from the PDAF pixels. This is meant to enable a depth-dependent blur, which is closer to what a professional camera does.

PDAF pixels work by capturing two slightly different views of a scene, shown below. Flipping between the two views, we see that the person is stationary, while the background moves horizontally, an effect referred to as parallax. Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.
The two PDAF images on the left and center look very similar, but in the crop on the right you can see the parallax between them. It is most noticeable on the circular structure in the middle of the crop.
However, finding these correspondences in PDAF images (a method called depth from stereo) is extremely challenging because scene points barely move between the views. Furthermore, all stereo techniques suffer from the aperture problem. That is, if you look at the scene through a small aperture, it is impossible to find correspondence for lines parallel to the stereo baseline, i.e., the line connecting the two cameras. In other words, when looking at the horizontal lines in the figure above (or vertical lines in portrait orientation shots), any proposed shift of these lines in one view with respect to the other view looks about the same. In last year’s Portrait Mode, all these factors could result in errors in depth estimation and cause unpleasant artifacts.

Improving Depth Estimation
With Portrait Mode on the Pixel 3, we fix these errors by utilizing the fact that the parallax used by depth from stereo algorithms is only one of many depth cues present in images. For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue. In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.

Designing a hand-crafted algorithm to combine these different cues is extremely difficult, but by using machine learning, we can do so while also better exploiting the PDAF parallax cue. Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.
Our convolutional neural network takes as input the PDAF images and outputs a depth map. The network uses an encoder-decoder style architecture with skip connections and residual blocks.
Training the Neural Network
In order to train the network, we need lots of PDAF images and corresponding high-quality depth maps. And since we want our predicted depth to be useful for Portrait Mode, we also need the training data to be similar to pictures that users take with their smartphones.

To accomplish this, we built our own custom “Frankenphone” rig that contains five Pixel 3 phones, along with a Wi-Fi-based solution that allowed us to simultaneously capture pictures from all of the phones (within a tolerance of ~2 milliseconds). With this rig, we computed high-quality depth from photos by using structure from motion and multi-view stereo.
Left: Custom rig used to collect training data. Middle: An example capture flipping between the five images. Synchronization between the cameras ensures that we can calculate depth for dynamic scenes, such as this one. Right: Ground truth depth. Low confidence points, i.e., points where stereo matches are not reliable due to weak texture, are colored in black and are not used during training. (Sam Ansari and Mike Milne)
The data captured by this rig is ideal for training a network for the following main reasons:
  • Five viewpoints ensure that there is parallax in multiple directions and hence no aperture problem.
  • The arrangement of the cameras ensures that a point in an image is usually visible in at least one other image resulting in fewer points with no correspondences.
  • The baseline, i.e., the distance between the cameras is much larger than our PDAF baseline resulting in more accurate depth estimation.
  • Synchronization between the cameras ensure that we can calculate depth for dynamic scenes like the one above.
  • Portability of the rig ensures that we can capture photos in the wild simulating the photos users take with their smartphones.
However, even though the data captured from this rig is ideal, it is still extremely challenging to predict the absolute depth of objects in a scene — a given PDAF pair can correspond to a range of different depth maps (depending on lens characteristics, focus distance, etc). To account for this, we instead predict the relative depths of objects in the scene, which is sufficient for producing pleasing Portrait Mode results.

Putting it All Together
This ML-based depth estimation needs to run fast on the Pixel 3, so that users don’t have to wait too long for their Portrait Mode shots. However, to get good depth estimates that makes use of subtle defocus and parallax cues, we have to feed full resolution, multi-megapixel PDAF images into the network. To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.

Try it Yourself
In Google Camera App version 6.1 and later, our depth maps are embedded in Portrait Mode images. This means you can use the Google Photos depth editor to change the amount of blur and the focus point after capture. You can also use third-party depth extractors to extract the depth map from a jpeg and take a look at it yourself. Also, here is an album showing the relative depth maps and the corresponding Portrait Mode images for traditional stereo and the learning-based approaches.

Acknowledgments
This work wouldn’t have been possible without Sam Ansari, Yael Pritch Knaan, David Jacobs, Jiawen Chen, Juhyun Lee and Andrei Kulik. Special thanks to Mike Milne and Andy Radin who captured data with the five-camera rig.