When Conventional Wisdom Fails: Revisiting Data Augmentation for Self-Driving Cars

By Matt Cooper

DeepScale is constantly looking for ways to boost the performance of our object detection models. In this post, I’ll discuss one project we launched towards that end and the unintuitive discovery we made along the way.

At the start of my internship with DeepScale, I was tasked with implementing a new data augmentor to improve our object detection efforts. One that stood out was a simple technique called cutout regularization. In short, cutout blacks out a randomly-located square in the input image.

Cutout applied to images from the CIFAR 10 dataset.

The original paper showed that cutout can significantly improve accuracy for vision applications. Because of this, I was surprised that when I applied it to our data, our detection mmAP decreased. I searched our data pipeline for the problem and found something even more surprising: all of the augmentors we were already using were hurting performance immensely.

At the beginning of this exploration, we were using flip, crop, and weight decay regularization — a standard scheme for object detection tasks. Through an ablation study, I found that each of these hurt detection performance on our internal dataset. Removing our default augmentors resulted in a 13% mmAP boost relative to the network’s initial performance.

Generally, we would expect adding weight decay, flip and crop to improve performance by a few points each, as shown in the dashed bars. In our case, however, these augmentors hurt mmAP by a relative 8.4%, 0.1% and 4.5%, respectively. Removing all augmentors lead to a total performance boost of 13%.

So why did these standard augmentors hurt our performance? To explain our unintuitive results, we had to revisit the idea of image augmentation from first principles.

(This section is an introduction to the intuition behind data augmentation. If you are already familiar with augmentation, feel free to skip to “Why self-driving car data is different.”)

Overfitting is a common problem for deep neural networks. Neural networks are extremely flexible; however, they are often overparameterized given the sizes of common datasets. This results in a model that learns the “noise” within the dataset instead of the “signal.” In other words, they can memorize unintended properties of the dataset instead of learning meaningful, general information about the world. As a result, overfit networks fail to yield useful results when given new, real-world data.

In order to address overfitting, we often “augment” our training data. Common methods for augmenting visual data include randomly flipping images horizontally (flip), shifting their hues (hue jitter) or cropping random sections (crop).

A picture of a giraffe (top left) shown with several common image augmentors: flip (top right), hue jitter (bottom left) and crop (bottom right). Despite these transformations, it is clear that each image is of a giraffe.

Augmentors like flip, hue jitter and crop help to combat overfitting because they improve a network’s ability to generalize. If you train a network to recognize giraffes facing right and on flipped images of giraffes facing left, the network will learn that a giraffe is a giraffe, regardless of orientation. This also forces the network to learn more meaningful and general information about what makes something a giraffe — for example, the presence of brown spotted fur.

Public datasets like the COCO object detection challenge show the need for generalization. Because these datasets contain images aggregated from many sources, taken from different cameras in various conditions, networks need to generalize over many factors to perform well. Some of the variables that nets need to contend with are: lighting, scale, camera intrinsics (such as focal length, principal point offset and axis skew), and camera extrinsics (such as position, angle and rotation). By using many data augmentors, we can train networks to generalize over all of these variables, much like we were able to generalize over giraffe orientation in the previous example.

These examples from the COCO dataset were taken with different cameras, from different angles, scales and poses. It is necessary to learn invariance to these properties to perform well on COCO object detection.

Unlike data from COCO and other public datasets, the data collected by a self-driving car is incredibly consistent. Cars generally have consistent pose with respect to other vehicles and road objects. Additionally, all images come from the same cameras, mounted at the same positions and angles. That means that all data collected by the same system has consistent camera properties, like the extrinsics and intrinsics mentioned above. We can collect training data with the same sensor system as will be used in production, so a neural net in a self-driving car doesn’t have to worry about generalizing over these properties. Because of this, it can actually be beneficial to overfit to the specific camera properties of a system.

These examples from a single car in the Berkeley Deep Drive dataset were all taken from the same camera, at the same angle and pose. They also have consistent artifacts, such as the windshield reflection and the object in the bottom right of each frame.

Self-driving car data can be so consistent that standard data augmentors, such as flip and crop, hurt performance more than they help. The intuition is simple: flipping training images doesn’t make sense because the cameras will always be at the same angle, and the car will always be on the right side of the road (assuming US driving laws). The car will almost never be on the left side of the road, and the cameras will never flip angles, so training on flipped data forces the network to overgeneralize to situations it will never see. Similarly, cropping has the effect of shifting and scaling the original image. Since the car’s cameras will always be in the same location with the same field of view, this shifting and scaling forces overgeneralization. Overgeneralization hurts performance because the network wastes its predictive capacity learning about irrelevant scenarios.

A front-view of the sensor array on DeepScale’s data collection car. All sensors are permanently mounted, so all data will have consistent extrinsics — position, angle and rotation. Because we use the same sensors at test-time, all data also has consistent intrinsics — focal length, principal point offset and axis skew. By harnessing the properties of a specific car’s sensors, we can boost vision performance when deploying the same sensor system.

The realization that self-driving car data is uniquely consistent explained our surprising augmentation results. Next, I wanted to see if we could leverage this consistency to further boost performance. Before introducing any new augmentors, I inspected our dataset to see if we could make any improvements at the data level. Our training set originally included images from two wide-angle cameras and a camera with a zoom lens. The zoom lens produces a scaling and shifting effect similar to crop augmentation. At test time, we only use the wide-angle cameras, so training on the zoom images forces the network to overgeneralize. I found that removing the zoom images from our training set gave us another large boost in mmAP. This confirmed our hypothesis that consistency between the train and test sets is important for performance.

After removing the original image augmentors, I trained and tested on a new, more consistent dataset. This improved mmAP by an additional 10.5% relative to our original scheme.

Following this, I considered augmentors that could vary our training data without changing the camera properties. Cutout, the augmentor I implemented at the start of this project, seemed like a good option. Unlike flip and crop, cutout doesn’t change the input in a way that drastically impacts camera properties (ie by flipping, shifting or scaling). Instead, cutout simulates obstructions. Obstructions are common in real-world driving data, and invariance to obstructions can help a network detect partially-occluded objects.

Obstructions are common in real-world driving data. In this image, two pedestrians block our view of the police car, while large bags block our view of the pedestrians.

Hue jitter augmentation can also help generalization without affecting camera properties. Hue jitter simply shifts the hue of the input by a random amount. This helps the network generalize over colors (ie. a red car and a blue car should both be detected the same). As expected, cutout and hue jitter both improved performance on our new test set.

Adding cutout and hue jitter augmentation to the new dataset increased relative mmAP by 1% and 0.2%, respectively. This gives us a total 24.7% boost over our original data scheme (flip, crop and weight decay on the old dataset). Note that the y axis is scaled to better show the difference of small improvements.

It’s worth noting that these augmentation tricks won’t work on datasets that include images from different camera types, at different angles and scales. To demonstrate this, I created a test set with varied camera properties by introducing random flips and crops to our original test set. As expected, our new, specialized augmentation scheme performs worse than our original, standard augmentors on the more general dataset.

When applied to consistent self-driving car data, our specialized augmentation scheme (cutout and hue jitter) provides an 11.7% boost in mmAP over the standard augmentation scheme (flip, crop and weight decay); however, when applied to more varied data, our specialized scheme results in a drop of 24.3% vs the standard scheme.

It’s always important to make sure that your test data covers the range of examples your model will see in the real world. Using specialized data augmentation makes this sanity-check even more essential. It’s easy to fool yourself into thinking that you’ve boosted your model’s performance, when you’ve really just overfit to a dataset that’s too easy (e.g. driving data with only clear, daytime images).

If your dataset really is robust and consistent, these tricks can be a powerful toolkit to improve performance. As shown, we were able to dramatically improve our object detection performance by enabling our network to learn the camera properties of our vehicle. This can be applied to any domain where training data is collected on the same sensor system as will be used in deployment.

Networks that perform well on satellite images (left) or cellular data (center) might require fundamentally different approaches than those built for common research datasets like ImageNet (right).

In hindsight, these augmentation changes might seem obvious. The reality is that we were blinded by conventional wisdom. Augmentors like flip and crop have been so broadly successful on research problems that we never thought to question their applicability to our specific problem. When we revisited the concept of augmentation from first principles, it became clear that we could do better. The field of machine learning has many similar “generic best practices,” such as how to set the learning rate, which optimizer to use, and how to initialize models. It’s important for ML practitioners to continually revisit our assumptions about how to train models, especially when building for specific applications. How does the vision problem change when working with satellite mapping data, or cellular imaging, as opposed to ImageNet? We believe that questions like these are underexplored in academia. By looking at them with fresh eyes, we have the potential to dramatically improve industrial applications of machine learning.

Matt Cooper is a deep learning software engineer (and former intern) at DeepScale. For more from DeepScale, check out our Medium page.