Data scientists, machine learning (ML) engineers, and ML researchers can’t be maximally productive if they have to wait days or weeks for their model training runs to complete.
The MLPerf benchmarking effort measures the time it takes for common ML problems to train to a given accuracy, and we contributed both GPU and TPU submissions to MLPerf 0.5, the initial version of this industry-wide benchmark contest. However, real-world ML challenges are much larger than the current MLPerf training tasks, so MLPerf does not yet reflect the systems' performance on very large problems.
As a case study that we believe is more representative of ML training performance in larger-scale settings, we train ResNet-50 v1.5, one of the MLPerf models, on the well-known ImageNet image classification dataset, but we focus on the steady-state training performance observed when processing a much larger collection of images. We have seen broadly similar scaling performance in other ML application domains, including machine translation, speech recognition, language modeling, GAN training, reinforcement learning, etc.
GCP provides a full spectrum of machine learning accelerators, and here we focus on Cloud TPU Pods and on Google Cloud VMs with NVIDIA Tesla V100 GPUs attached. Cloud TPU Pods, now available in alpha, are tightly-coupled supercomputers built with hundreds of Google’s custom Tensor Processing Unit (TPU) chips and dozens of host machines, all linked via an ultrafast custom interconnect.
To ensure that anyone can fully reproduce the results described below today, we use well-optimized, open-source TensorFlow 1.12 implementations of ResNet-50 v1.5 (GPU version, TPU version). To simulate a larger-scale ML training scenario (imagine eBay training on tens or hundreds of millions of product images), we run a warm-up epoch on both the GPU and TPU systems before starting our measurements; this excludes one-time setup costs and evaluation costs and ensures that caches are fully filled. All of the systems shown train ResNet-50 to the same quality score of 76%
top-1 accuracy. Further details are available on our methodology page.
Our results demonstrate that Cloud TPU Pods deliver near-linear speedups for this large-scale training task; the largest Cloud TPU Pod configuration tested (256 chips) delivers a 200X speedup over an individual V100 GPU (see chart below). So instead of waiting for more than 26 hours if you used a single state-of-the-art GPU, a full Cloud TPU v2 pod delivers the same result in 7.9 minutes of training time.