Hardware for Deep Learning. Part 3: GPU

By Grigory Sapunov

GPUs, Graphics Processing Units, are specialized processors originally created for computer graphics tasks. Modern GPUs contain a lot of simple processors (cores) and are highly parallel, which makes them very effective in running some algorithms. Matrix multiplications, the core of DL right now, are among these.

The most modern DL systems are a mix of CPU and GPU, where the GPU does the heavy lifting, and CPU is responsible for loading the data into/from the memory of a graphics card and orchestrating the calculations.

The training is much more calculation intensive process than the inference, and GPUs are especially important for the training mode. For the inference they are good as well, but here may play other factors (like size, power consumption, price, etc) depending on the target system you are developing a neural network (NN) for.

Among GPUs the NVIDIA ones are beyond comparison, because almost every DL framework supports NVIDIA GPUs while have no support of AMD GPUs. So AMD almost lost this battle. There are some activities and I’ll return to AMD at the end of the post.

Let’s discuss important aspects of GPUs.

GPU performance still grows. If you draw a chart with peak performance in GFLOPS it will look like this (added some reference non-GPU points):

Important: This is FP32, a single-precision float, performance. This is not the only option to measure, you’ll learn about FP16/FP64/INT8 soon. So, you may see other charts with larger numbers. But anyway, FP32 is a good common ground, because you’ll see that there are many caveats with others.

Important: Peak performance can be very far from the performance on the real tasks. More correctly to say, the real performance can be far behind the peak performance (and you’ll see it below). It’s because to achieve the peak performance you have to heavily optimize your calculations, keeping all parts of the processing pipeline optimally loaded, avoiding bottlenecks and so on. Maybe it is achievable, but I had not seen any DL developer wanting to spend time on such hardcore optimizations instead of working with the neural networks themselves. Moreover it requires a completely different skill set and expertise, with the low level understanding of GPU architecture (or several architectures). So here is a niche for special-purpose software to optimize you DL-related calculations, and NVIDIA TensorRT is a one example of such class of software, dedicated specifically to inference (but I think it generally works on the higher levels than I described), others could be implemented into DL frameworks (like we have optimization options in compilers) and special libraries. Maybe even once we’ll have a special AI to solve this optimization problems (like Google did it in its papers). But anyway, peak performance is a proxy for the real-world performance, so treat it wisely. You’ll see examples of real performance comparing to the peak performance soon.

You can find the tables with the data and comparisons in my Google Doc here.

For comparison, the new 18-core Intel Core i9 Extreme Edition (i9–7980XE) with 160W TDP and $1999 recommended price is called the ‘First teraflop-speed’ consumer PC chip (but I’m not sure exactly which TFLOPS are mentioned, I suppose FP64). Half the price GTX 1080 Ti delivers 10x more TFLOPS. Here is a more popular version with a bit of history.

Calculating FLOPS for modern processors is complicated due to features such as vectorization, fused multiply-add, hyperthreading, “turbo” mode and so on. We can make rough estimates. Intel Haswell/Broadwell/Skylake performs 32 SP FLOPs/cycle, Skylake-X performs 64 SP FLOPs/cycle (thanks to AVX-512, see the CPU post of the series on more details on AVX-512).

So, for a single 18-core 7980XE (Skylake-X) working at base frequency of 2.60 GHz (in Turbo mode it can be up to 4.20 GHz) the Peak Performance in GFLOPS is 18*2.6*64 = 2995, so near 3 TFLOPS FP32. It’s 3x times larger than “teraflop-speed”. Maybe it’s because the frequency behaviour is complex, especially in the case of AVX modes. The base frequency is applicable only to non-AVX workloads. For workloads heavy in AVX-512 the CPU reduces the clock frequency. If the article mentioned FP64 performance, and the AVX base frequency is lower that 2.60 GHz, then 1 TFLOPS FP64 could be understandable.

For a 6-core i7-6850K (Broadwell) with no AVX, working at 3.60 GHz base frequency the estimate is 6*3.6*32 = 690 GFLOPS FP32.

Correct me if I made mistakes somewhere, pls.

BTW, if you know a reliable source of Intel/AMD peak/real performance metrics (in FLOPS, not a special scores), let me know. It seems that Intel do not like to participate in these comparisons.

There is a trend towards using FP16 (half precision) instead of FP32 (single precision) because lower precision calculations seem to be not critical for neural networks. This also makes the double precision (FP64) not useful, because additional precision gives nothing, while being slower.

There is a Mixed-precision training mode which uses both single- and half-precision representations. NVIDIA expanded the set of tools available for mixed-precision computing since Pascal architecture and CUDA 8. Intel likes the topic as well, see the January 19, 2018 whitepaper called “Lower Numerical Precision Deep Learning Inference and Training”.

“Storing FP16 (half precision) data compared to higher precision FP32 or FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers. Moreover, for many networks deep learning inference can be performed using 8-bit integer (INT8) computations without significant impact on accuracy.” [source]

In addition to making possible to train and store larger models, switching to FP16 typically gives 2x speed improvement (2x more TFLOPS).

FP16 is natively supported since Tegra X1 and Pascal architecture.

“Prior to these parts, any use of FP16 data would require that it be promoted to FP32 for both computational and storage purposes, which meant that using FP16 did not offer any meaningful improvement in performance or storage needs. In practice this meant that if a developer only needed the precision offered by FP16 compute (and deep learning is quickly becoming the textbook example here), that at an architectural level power was being wasted computing that extra precision.” [source]

But there are caveats. NVIDIA has severely limited FP16 and FP64 CUDA performance on gaming cards (including 1080 Ti, Titan X/Xp). FP16 performance is 1/64th and FP64 is 1/32th of FP32 performance.

(Sep 2018) The situation is now changed with Turing architecture and the new series of RTX gaming cards (RTX 2070/2080/2080 Ti). Turing chips support unrestricted FP16 calculations.

More on comparing Tesla and Geforce.

AMD Radeon RX Vega has no restrictions for FP16, giving 2x performance compared to FP32, while FP64 is slower (1/16th).

INT8 is useful to make inference faster. INT8 leads to t̶h̶e̶ ̶g̶o̶o̶d̶ ̶o̶l̶d̶ ̶8̶-̶b̶i̶t̶ ̶w̶o̶r̶l̶d̶ significantly narrower dynamic range and lower precision, and it could be a challenge to completely move to integer arithmetic for neural networks, but converting existing networks (originally trained using FP32) does work. Here are some reflections on it. As I understand, there are no good cases on using INT8 for training on GPUs.

INT8 requires sm_61+ (Pascal TitanX, GTX 1080, Tesla P4, P40 and others). You can find compute capability supported for all NVIDIA chips here.

INT1 doesn’t exist as an option right now, but there are many works on binarization of neural networks (replacing float weights by binary 0/1 connections): “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1”, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks” and XNOR.AI, ”How to Train a Compact Binary Neural Network with High Accuracy?”, “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”, and so on. Here is a short overview and more references on binary deep learning. It leads to even more radical storage requirements reduction, and faster calculations as well. Such kind of calculations have a particularly good fit for FPGA, but CPUs and other options could like it as well. I won’t be surprised if NVIDIA or Intel starts talking on it one time. (Sep 2018 update: They did it. See below)

Both FP16 and INT8 save memory and could give a significant speedup comparing to FP32 (keep in mind, that TensorRT mentioned at the picture makes many other optimizations as well):

(Sep 2018) Nvidia recently launched TESLA T4 inference accelerator with INT4 support, which is twice faster than INT8. And there are some talks on INT1: “We have some researchers who have published work that even with only four bits they can maintain high accuracy with extreme small, efficient, and fast models. You could even go to INT1, but that is pretty advanced stuff and still a research topic”.

TESLA T4 performance on FP16/INT8/INT4

INT4 is supported by all the chips of the Turing architecture.

Here is a table summarized FP16/INT8/INT4/FP64 speedups/slowdowns for many popular GPUs:

FP16/FP64/INT8/INT4 native performance relative to FP32

Let me know if you find errors here, or leave a comment in the source file.

Knowing this, we can make a chart for FP16 comparisons (I left mostly the recent GPUs here). RTX gaming cards look very interesting.

And a same chart for INT8 (TOPS instead of GFLOPS because we now count integer operations, not floating point ones).

The numbers for 2017 and older cards are calculated based on their FP32 performance, while the numbers for 2018 cards are from NVIDIA documentation, and it seems they calculate INT8 performance based on Tensor Core FP16 (see next section). Therefore there are probably some errors in this chart, mostly regarding Tesla V100/TitanV which do have tensor cores as well (so their numbers should be higher). I’ll fix it soon.

NVIDIA’s Volta architecture has additional Tensor Cores and promises that it will deliver near 120 TFLOPS (FP16) for V100 (a bit more for Tesla V100, a bit less for Titan V). That’s huge numbers, and they are high above the graph with the maximum value near 15 TFLOPS (FP32). Be careful, I switched here to TFLOPS (=1000 GFLOPS).

(Sep 2018) Turing architecture and the new series of RTX gaming cards (RTX 2070/2080/2080 Ti) do have tensor cores on board. So, for example, RTX 2080 Ti with more than 100 TFLOPS FP16 looks very promising.

We can draw a graph with FP16 performance for tensor cores:

cuDNN supports Tensor Cores since version 7. Kernels using Tensor Core Operations are available for both CNNs and RNNs for forward and backward computations. More details are here. cuDNN may prefer not to use Tensor Core Operations (for instance, when the problem size is not suited to Tensor Core acceleration), and instead use an alternative implementation based on regular floating point operations. You have to manually turn on using Tensor Core operations by setting the math mode to CUDNN_TENSOR_OP_MATH, because the default in cuDNN is CUDNN_DEFAULT_MATH, and it indicates that the Tensor Core operations will be avoided by the library.

NVIDIA states that “Tensor Cores are already supported for Deep Learning training either in a main release or via pull requests in many Deep Learning frameworks (including Tensorflow, PyTorch, MXNet, and Caffe2). For more information about enabling Tensor Cores when using these frameworks, check out the Mixed-Precision Training Guide. For Deep Learning inference the recent TensorRT 3 release also supports Tensor Cores.”

Tensor cores look cool, and NVIDIA benchmarks are impressive:

Performance comparison of convolution on Tesla V100 (Volta) with Tensor Cores versus Tesla P100 (Pascal). The comparison is between the geometric means of run times of the convolution layers from each neural network. Both V100 and P100 use FP16 input/output data and FP32 computation; V100 uses Tensor Cores, while P100 uses FP32 fused-multiply add (FMA). https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

The problem is it’s totally unclear how to approach the peak performance of 120 TFLOPS, and as far as I know, no one could achieve so significant speedup on real tasks. Let me know if you aware of good cases.

  • RNN/LSTM benchmark achieved only upto x2 speedup comparing Tesla V100 to P100.
  • The same x2 speedup comparing Tesla V100 to P100 on CNNs.
  • CNN benchmark comparing Titan V with Titan Xp gives near x2 speedup (mostly due to switching to FP16 from FP32)
  • Baidu benchmark provides similar results obtaining less than 20 TFLOPS on convolutions with Tesla V100 Mixed Precision. That’s far from 120 TFLOPS.
  • Another set of computer vision tasks shows the similar picture with slightly better V100 FP32 performance comparing to 1080 Ti, and most of the gain (near 2x) obtained from switching to FP16 (which is totally understandable). It would be interesting to test P100 FP16 here.

2x speedup is cool, but not as cool as theoretical 12x difference between peak performance of P100 and V100.

There are two important issues regarding memory: memory size and memory bandwidth.

Here is a simple rule, the more, the better.

I think, 11–12 Gb on current top gaming cards is the standard now. There exist many NN models that do not fit into 6Gb of memory, and you’ll suddenly find that 8Gb is not enough as well. If you like to play NNs like Lego (and they are actually a kind of Lego), then you’ll find yourself combining different models, and you need memory for it.

Many TESLAs have 16 Gb, and there are new Turing architecture Quadro models (RTX 6000, RTX 8000) with 24 and 48 Gb GDDR6 respectively. Quadro GV100 (Volta) has 32 Gb HBM2.

There are some tricks like reducing batch size, converting models to FP16 and even INT8 (in the case of inference), pruning, binarization, and so on (more on approaches to memory saving). Some of them are easy to implement (or they are already supported in your framework of choice), some are not. Anyway, you’d like to have more memory available.

The tricky part here is that some NN-related calculations are bandwidth-limited, not computation-limited! This basically means that your cool hot GPU works significantly below it’s peak performance, and either you could possibly achieve the same results with the cheaper hardware, or you could make it more efficient with other hardware (not necessarily with larger performance, but with larger bandwidth).

Here is a post on computation and memory bandwidth by Eugenio Culurciello et al.

The paper states that in some worst-case conditions the efficiency of the GPU is in the range of 15–20% of peak theoretical performance! So, your brand new Tesla V100 turns into GTX 1050 Ti — GTX 1060.

In such bad cases looking at bandwidth can be helpful. For example, GTX Titan Xp (10790 FP32 GFLOPS, 547 GB/s bandwidth) can be faster than GTX 1080 Ti (10609 FP32 GFLOPS, 484 GB/s) by up to 13% on such bandwidth-limited tasks. And GTX Titan V with 652.8 GB/s looks even better. Tesla V100 has 900 GB/s, P100 has 720 GB/s.

From the paper “Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs”:

Convolutional layers are typically most time consuming in a whole network. Therefore, achieving high arithmetic throughputs has been the main optimization objective. … Counter-intuitively, we observe that the convolutional layers are not necessarily only compute bound. Specifically, for convolutional layers with small C and N dimensions, the performance is actually memory bound similar to 2D convolution.

Pooling layers are usually paired with convolutional layers in CNNs. Compared to convolutional layers, pooling layers have low arithmetic complexity, O(N*C*H*W). Its performance is mainly bounded by memory efficiency (i.e., bandwidth and latency)

Each step in the softmax layer involves element-wise matrix or matrix-vector computation. The low arithmetic intensity in these matrix vector operations, and the intermediate data communication across different steps also make them memory bound.”

In typical RNN implementations recurrent weight matrix has to be reloaded from memory on each timestep, making RNN calculations a bandwidth bound. Due to their inherent sequentionality it’s hard to make RNN to use all the resources of GPU.

There exist some solutions (like Persistent RNN) to make a bandwidth bound problem a compute bound problem. According to the article on Persistent RNNs:

“The peak floating point throughput of a Titan X is 6.144 TFLOP/s. A straightforward implementation of a RNN using GEMM operations achieves 0.099 TFLOP/s at a layer size of 1152 using Nervana Systems GEMM kernels at a mini-batch size of 4. Our initial Persistent RNN implementation with the same layer and mini-batch size achieves over 2.8 TFLOP/s resulting in a 30x speedup.”

There are many other ways of not exploiting the full potential of your GPU and making your neural network slow, starting from incorrect batch sizes and going to more intricate issues. From the previous paper on optimizing memory efficiency:

Moreover, even for the layers that have been always considered to be compute-bound, i.e., convolutional layers, we found that choosing the suitable data layout could lead up to 2.3x performance improvement.“

There exists a Roofline Performance Model proposed by Samuel Williams, Andrew Waterman, and David Patterson, U.C. Berkeley, which gives a better understanding of the actual performance of calculations:

Source: http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf

Here is a one example of NN benchmark from the paper on Google TPU (more on it in later parts) done on K80 GPU:

You see that most NNs work below the peak performance, sometimes significantly below. And some of them are actually in the area bounded by bandwidth constraints. So, keep this in mind. More on the Roofline.

Here is a recent talk from Intel AI product group at CogX 2018 mentioning the Roofline model with examples highlighting that a proper optimization of communications can significantly improve performance (e.g. 50x).

There is a trend towards using single system multi-GPU configurations and even distributed multi-system multi-GPU configurations. It partially solves problems with limited performance and memory.

First of all, it’s not about SLI, you may heard about in the good old days of computer gaming. SLI is a technology to link 2–4 GPUs to share the work on rendering an image. It’s only about rendering graphics.

In CUDA (that is about computations, not graphics) you can directly access any available GPU in your system, so just add several GPUs and use any of it. You can write your program to do anything you want, loading data into any GPU and running computations on a GPU of choice.

Usually deep learning engineers do not write CUDA code, they just use frameworks they like (TensorFlow, PyTorch, Caffe, …). In any of these frameworks you can tell the system which GPU to use.

But choosing the specific device to train your neural network is not the whole story. It’s a normal way if you have to train several models at once (maybe trying different parameters, so we can call this mode as Hyper-Parameter Parallel). What if you want to train larger models faster?

So you need to split the work across multiple GPUs in your system (and between several systems too). Distributed training is the answer.

There are two main approaches of parallelizing neural network training: model parallelism and data parallelism.

In model parallelism, different machines in the distributed system are responsible for the computations in different parts of a single network — for example, each layer in the neural network may be assigned to a different machine.

In data parallelism, different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined.

From SkyMind docs

Data parallelism is more popular, but these approaches are not mutually exclusive. And for some cases you have to split your models upon several machines just because the model is too large for a single machine. See, for example, Google NMT:

The model architecture of GNMT, Google’s Neural Machine Translation system. The model is partitioned into multiple GPUs to speed up training. In our setup, we have 8 encoder LSTM layers (1 bi-directional layer and 7 uni-directional layers), and 8 decoder layers. With this setting, one model replica is partitioned 8-ways and is placed on 8 different GPUs typically belonging to one host machine. The softmax layer is also partitioned and placed on multiple GPUs. Depending on the output vocabulary size we either have them run on the same GPUs as the encoder and decoder networks, or have them run on a separate set of dedicated GPUs.

NVIDIA has a Collective Communications Library (NCCL) that implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. Developers of deep learning frameworks and HPC applications can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes.

NCCL 1.x was limited to intra-node, NCCL 2.x supports multi-node configurations:

Intel has a similar library called Machine Learning Scaling Library (MLSL), AMD has ROCm Communication Collectives Library (RCCL).

Many DL frameworks support distributed training: Distributed TensorFlow, Horovod for TensorFlow and Keras, PyTorch, Caffe2, CNTK, Deeplearning4j (using Apache Spark), MXNet/Gluon, PaddlePaddle (Baidu’s framework which name is an acronym from PArallel Distributed Deep LEarning), there is even Apache SINGA (which doesn’t seem to be actively developed).

These solutions offer close to linear speed-ups to the number of cards. Here is a benchmark based on CNTK using NCCL 2:

Source

And here is a benchmark from Uber’s Horovod:

The performance of training a model using 128 GPUs is quite impressive and is not far from the ideal case. The similar results was shared by Baidu (Horovod is actually built upon their work), and Baidu published recently another interesting paper on deep learning scaling.

In short, distributed training on GPUs is now a commodity.

NVIDIA NVLink is an energy-efficient, high-bandwidth path between the GPU and the CPU.

Remember the part on CPUs, we talked about x8/x16 configurations for PCIe. PCI Express 3.0 (PCIe v.3) allows for 985 MB/s per 1 lane, so 15.75 GB/s for x16 links (so, twice slower for x8 configuration). That’s the speed at which your CPU exchanges data with your GPU.

PCIe v.4 (released in the Fall 2017, we’ll probably see Intel support it in 2018 and AMD in 2020, and BTW is already available in IBM’s POWER9 chip, but this chip has something even better!) allows for twice faster communication (31.51 GB/s for x16), PCIe v.5 (will be released in 2019) supports twice more faster speeds (63 GB/s for x16).

NVLink 1.0/2.0 allows for 80/150 GB/s. So it could make sense even for a single GPU. And there are CPUs supporting it. IBM POWER8+ and later allows connecting up to four NVLink devices directly to the chip. POWER9 supports NVLink 2.0. It’s a pity we do not have such desktops in availability.

Here is a benchmark of the Host-GPU bandwidth in different configurations.

In addition to speeding CPU-to-GPU communications for systems with an NVLink CPU connection, NVLink can have significant performance benefit for GPU-to-GPU (peer-to-peer) communications as well. That’s even cooler! Two or more GPUs can communicate with each other directly without the need to transfer data through the central hub, the CPU. Distributed training should be faster with NVLink.

Source article

Here is a paper which focuses on these peer-to-peer benefits from NVLink.

Data transfer is not the only work of the GPU, so the real benefit will be less that the ratio between peak data transfer rates, but it helps in real tasks. Here is an example of calculating 3D FFT (I think it’s the closest to neural networks among cases in the paper):

Source article

Over 2x speedup for NVLink-connected GPUs comparing to PCIe 3.0 connected GPUs. That’s cool.

AMD has a similar technology called Infinity Fabric.

One more benefit from NVLink. With CUDA 6, NVIDIA introduced “one of the most dramatic programming model improvements in the history of the CUDA platform”, the Unified Memory.

Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

https://devblogs.nvidia.com/unified-memory-in-cuda-6/

With 80 GB/s or higher bandwidth on machines with NVLink-connected CPUs and GPUs, that means GPU kernels will be able to access data in host system memory at the same bandwidth the CPU has to that memory (for quad-channel DDR4–3200 that should be 4*25600 MB/s = near 100 GB/s, it’s lower than NVLink 2.0 bandwidth) — much faster than PCIe. Host and device portions of applications will be able to share data much more efficiently and cooperatively operate on shared data structure, and supporting larger problem sizes will be easier than ever.

https://devblogs.nvidia.com/how-nvlink-will-enable-faster-easier-multi-gpu-computing/

How could it help for Deep Learning applications? Obviously, you are not limited now by the GPU memory size. Many applications can benefit from it.

If we only had a POWER9 system… (see below)

It seems that the current DL frameworks do not support Unified Memory. And I didn’t hear that there are any plans somewhere yet. You probably have to write it from scratch using CUDA/cuDNN. But there is a similar movement dedicated to support of large models.

At NIPS’17 Alibaba presented a paper “Training Deeper Models by GPU Memory Optimization on TensorFlow” with a similar approach without using Unified Memory. They state that “our tests show that it [Unified Memory] can bring a severe performance loss (maximum ten times degradation)” and proposed a general based approach called “swap-out/in”, which is targeted for any kind of neural network. Additionally they designed a memory-efficient attention algorithm for Seq2Seq models. A similar approach from IBM described below.

I’d expect to see transferring these ideas into the practical field (being integrated into DL frameworks) soon, because the GPU limited memory is a severe constraint.

On systems with x86 CPUs (such as Intel Xeon), the connectivity to the GPU is only through PCI-Express (although the GPUs can connect to each other through NVLink). On systems with POWER8/9 CPUs, the connectivity to the GPU is through NVLink (in addition to the NVLink between GPUs).

NVLink is not about gaming cards at all. It is available for professional NVIDIA cards like Quadro and Tesla.

(Sep 2018) Turing gaming cards now have NVLink!

The TU102 and TU104 GPUs (RTX 2080/2080 Ti, but _NOT_ 2070) include the second generation of NVIDIA’s NVLink high-speed interconnect, originally designed into the Volta GV100 GPU, providing high-speed multi-GPU connectivity for SLI and other multi-GPU use cases. NVLink permits each GPU to directly access memory of other connected GPUs, providing much faster GPU-to-GPU communications, and allows combining memory from multiple GPUs to support much larger datasets and faster in-memory computations.

The Turing TU102 GPU (RTX 2080 Ti) includes two x8 second-generation NVLink links, and Turing TU104 (RTX 2080) includes one x8 second-generation NVLink link. Each link provides 25 GB/sec peak bandwidth per direction between two GPUs (50 GB/sec bidirectional bandwidth). Two links in TU102 provides 50 GB/sec in each direction, or 100 GB/sec bidirectionally.

So, placing two 2080 Ti and connecting them using NVLink seems to be useful. You’ll get a 22 Gb memory GPU with 100 GB/sec interconnect this way.

At GTC’18 NVIDIA announced NVSwitch able to support 16 fully-connected GPUs in a single server node and drive simultaneous communication between all eight GPU pairs at 300 GB/s each. These 16 GPUs can be used as a single large-scale accelerator with 0.5 Terabytes of unified memory space and 2 PFLOPS FP16 performance.

https://www.nvidia.com/en-us/data-center/nvlink/

AI enters the era of democratization. It happens at the different levels. One of the levels is the hardware level.

The processing power is not only becoming larger, it is becoming more available. Google’s work on large-scale unsupervised learning of 2011 year (remember that work with NN processed 10 millions of images and found a cat face) was replicated in 2013 with much smaller resources. Here’s a Wired story about it. The price comparisons differ between the Wired article ($1M → $20K) and NVIDIA slides ($5M → $33K), but the trend is obvious. GPU was a game-changer, and Deep Learning (with good results) became much more affordable.

There is a TOP500 list of supercomputers measured by the LINPACK benchmark on double precision (FP64) floating point operations. The benchmark solves a dense system of linear equations. Remember, for neural networks FP32/FP16 are usually used, and NVIDIA likes to report their performance in terms of FP16 calculations (which is understandable, because it gives 4x large numbers).

Let’s take for example NVIDIA GTX Titan V, with its FP64 peak performance of 6900 GFLOPS = 6.9 TFLOPS. It corresponds to the best supercomputer in the world at 2001–2002 (IBM ASCI White with 7.226 TFLOPS peak speed) and a supercomputer on 500th place (still a cool supercomputer) of the TOP500 list in November 2007 (the entry level to the list was the 5.9 TFlop/s).

It’s incorrect to compare FP16/FP32 with FP64 performance metrics, but for tasks that can tolerate lower precision (neural networks can) and in terms of number equations solved in a unit of time, the modern gaming card NVIDIA GTX 1080 Ti (with more than 10 TFLOPS FP32 peak performance) is a desktop supercomputer of the recent past.

Here is a news article from 2002 about 11 teraflops supercomputer:

“The 1- to 10-teraflops processing range is opening up a revolutionary capability for scientific applications. It’s qualitatively different from what we have been able to do before,” said Seager. The difference lies in the number of variables the computer can process at the same time and in the resolution of the simulation. Not only are simulations much closer to realistic physical experiments, he said, but it takes far less time to converge to a reasonable approximation. This type of capability elevates computer simulation to the same level as physical experiment and theory, so it is going to allow us to do groundbreaking scientific work”

Now many of us have comparable processing power (and the storage became much cheaper too) and can in principle be a scientific lab of a reasonable past doing world-class research. But who cares?.. Everyone counting hashes…

Rapid growth of supercomputers performance, based on data from top500.org site. The logarithmic y-axis shows performance in GFLOPS. [Source]

So, the trend continues, and the modern 500th place supercomputer in the world will be at your desktop 10 years from now (or maybe in a phone, watch, jacket, toothbrush, under the skin? remember, the size of the “10 TFLOPS” bundle reduced dramatically too).

For comparison, the mobile processor Huawei’s Kirin 970 (already in smartphones, more on mobile AI in later posts) with a Neural Network Processing Unit (NPU) on board is said to deliver 1.92 TFLOPS FP16. It’s the top performing supercomputer of 1997 (but remember again the difference between FP16/FP64).

NVIDIA rides this wave too. There are two Deep Learning Supercomputers called DGX-1 server and DGX station.

DGX-1 (3U rackmount solution) started with 8xTESLA P100 (DGX-1P), now upgraded to 8xTESLA V100 (DGX-1V) providing near 1000 TFLOPS or 1 PFLOPS FP16 (but only 62.4 TFLOPS FP64, here is not 1/4 performance of FP16 because FP16 performance is measured for tensor cores, which work only with FP16, so it’s just 8*5300 TFLOPS FP64 for TESLA V100). In terms of FP16 performance that’s close to the FP64 (yes, again, that’s not the right way to compare) performance of the #1 supercomputer of 2008–2009 (Roadrunner with 1.105 PFLOPS) and could be in the list of November 2017.

BTW there is #149 DGX SaturnV Volta36 (1.8 PFLOPS, 97 KW, some V100, maybe 36 DGX-1V?) and #36 DGX Saturn V (4.9 PFLOPS, 349.5 KW, comprised of 124 DGX-1P with P100) ready to be upgraded to a new one comprised of 660 nodes by 8xV100 each, resulting in a total of 5280 Volta GPU accelerators yielding 40 PFLOPS FP64 (and theoretically 660 PFLOPS FP16 on Tensor Cores), which in theory would make it among the top ten systems in the world even at double precision floating point. That’s impressive.

BTW, what neural network would you train if you had this 660 PFLOPS supercomputer in availability?

DGX-1 with P100 is priced at $129,000, DGX-1 with V100 is priced at $149,000.

At GTC’18 NVIDIA announced DGX-2, a machine with 16 TESLA V100 32GB (twice more GPUs with twice more memory per GPU than previous V100 has) resulting in 512GB total HBM2 GPU memory, 1.5TB system memory, and 2 PFLOPS FP16 performance.

DGX-2

With DGX-2 you have 4x more memory and 2x more performance comparing to DGX-1.

DGX-2 is priced at $399,000.

Here is a more detailed price-performance comparison of DGX-1 and DGX-2.

DGX Station (desktop/office solution) represents an economical midway point to acquire performance-optimized accelerated compute workstations at half the price and half the performance of its server form factor sibling (the DGX-1). It is coined as “the world’s first personal supercomputer for leading-edge AI development.” IDC expects this trend to continue.

It contains 4xTESLA V100 and delivers up to 500 TFLOPS FP16 (peak performance). Right now it is sold for $49,900 (usual price is $69,000).

It’s now about just putting several TESLAs V100 into a single machine, there is also NVLink (which we know is cool!) connections between GPUs, 4x 100 Gb InfiniBand network interface cards, and maybe some other optimizations. Like many traditional workstations, the DGX Station is designed to operate with limited noise. It uses a standard 115–240 VAC outlet and can draw up to 1500W.

Many have heard about DGX, but a rare person knows that IBM has a similar project called Minsky.

Power Systems S822LC (“Minsky”)

September 2016, IBM presented a Power Systems S822LC for High Performance Computing (code-named “Minsky”), containing two Power8 CPUs (8–10 cores each), and four Nvidia Tesla P100 GPUs. “Minsky” was the second system on the market to use Nvidia’s P100 GPU (the first one was DGX-1).

Source

The important difference with the DGX-1 machine from Nvidia, is that DGX-1 uses NVLink ports interconnect 8*Tesla P100 SMX2 cards, but the cards are put onto a motherboard with two “Haswell” Xeon E5 v3 processors from Intel and the GPU-CPU link is using regular PCI-Express links through a quad of PCI switches:

https://www.nextplatform.com/2016/09/08/refreshed-ibm-power-linux-systems-add-nvlink/

With the Minsky machine, IBM is using NVLink ports on the Power8 CPU for GPU-CPU communication as well. The two NVLink connections between the POWER8 CPU and the Tesla P100 GPUs enable data transfer over 2.5 times faster than the traditional Intel x86 based servers that use PCIe x16 Gen3:

https://www.ibm.com/blogs/systems/ibm-nvidia-present-nvlink-server-youve-waiting/

So, the DGX-1 machine, which Nvidia has tuned up specifically for deep learning, has more GPUs but they are less tightly coupled to the Intel Xeon CPUs and they have less bandwidth between the GPUs as well.

The IBM system has fewer GPUs and more bandwidth between the compute elements. IBM is aiming this Minsky box at HPC workloads, but there is no reason it cannot be used for deep learning.

According to The Next Platform: “With two of the ten-core Power8 chips running at 2.86 GHz, 128 GB of main memory, and four of the Tesla P100 accelerators, Boday says IBM will charge under $50,000. Nvidia is charging $129,000 for a DGX-1 system with eight of the Tesla cards plus its deep learning software stack and support for it. In other words, IBM’s Minsky pricing is consistent with Nvidia’s DGX-1 pricing.”

There is a pricing called “IBM Power System S822LC for Commercial Computing” but it seems that these machines are without GPUs.

There is another product called “IBM Power System S822LC for High Performance Computing” with up to 4 TESLA P100, but the pricing is “Contact Us”.

April 2017 IBM announced these servers are planned to come to the Bluemix compute infrastructure, but I cannot find them. Right now there are only 4 POWER8 bare metal servers, none of which has GPUs, and all GPU servers are Intel Xeon servers.

Power Systems AC922 (“Newell”)

05 December, 2017 IBM unveiled its POWER9-based servers targeted at Enterprise AI. Power Systems AC922 is a more recent and much more interesting option. The AC922 known variously by the code-name “Witherspoon” or “Newell,” is the building block of the CORAL systems being deployed by the US Department of Energy — “Summit” at Oak Ridge National Laboratory and “Sierra” at Lawrence Livermore National Laboratory.

IBM POWER GPU Intensive Roadmap / Source

It contains 2xPOWER9 CPUs (available on configurations with anywhere between 16 and up to 44 cores) and 2–6 NVIDIA Tesla V100 GPUs with NVLink .The AC922 extends many of the design elements introduced in Power8 “Minsky” boxes with a focus on enabling connectivity to a range of accelerators — Nvidia GPUs, ASICs, FPGAs, and PCIe-connected devices — using an array of interfaces. In addition to being the first servers to incorporate PCIe Gen4, the new systems support the NVLink 2.0 and OpenCAPI protocols, which offer nearly 10x the maximum bandwidth of PCI-E 3.0 based x86 systems, according to IBM.

POWER AC922 Design — 6 GPU / Source

More on AC922 here.

Intel up to now has had a virtual monopoly in server chips, with well over 90 percent of the market. But with Power9, IBM hopes to capture 20 percent of the market by 2020. [source]

IBM provides PowerAI platform, which includes the most popular deep learning frameworks and their dependencies, and contains distributed deep learning (DDL) library (IBM Research was able to scale deep learning frameworks across up to 256 GPUs with up to 95 percent efficiency):

IBM PowerAI Platform / Source

PowerAI requires installation on IBM Power Systems S822LC for HPC or AC922 server infrastructure (more about PowerAI releases).

Among interesting features IBM developed is the Large Model Support (LMS). LMS uses system memory in conjunction with GPU memory to overcome GPU memory limitations in Deep Learning Training. And that perfectly makes sense on a system such as the AC922 (remember Unified Memory section in the discussion of NVLink).

Here are the results of of running 1000 iterations of an enlarged GoogLeNet model (mini-batch size=5) on an enlarged ImageNet Dataset (crop size of
2240x2240, so 100x larger images than in ordinary ImageNet) on two platforms (the key difference between two platforms is NVLink 2.0):

The corresponding IBM’s paper (from SysML’18 conference happened February 15-16, 2018) additionally says: “We also observed that LMS can improve the training performance by maximizing GPU utilization. For Resnet-152 on Caffe, the maximum batch size without LMS was 32 and the corresponding throughput was 91.2 images/sec. With LMS, we were able to increase the batch size to 48 and improved the throughput to 121.2 images/sec in spite of the CPU-GPU communication overhead.”

There are additional LMS benchmarks available.

POWER9 with NVLink looks cool. More on AI/HPC tests here.

Microway AC922 servers are priced $55000 to $75000. There are also other options. For example, Microway sells 2*Xeons + 2–4 V100 servers for $24,000 to $75,000 and OpenPOWER servers with 2*POWER8 + 2–4 P100 for $35,000 to $75,000.

More analysis on AC922.

Maybe we’ll see some interesting announces at IBM Think 2018 conference during March 19–22.

I wish I had an AC922 available, but DGX and IBM solutions are in the enterprise price range, not for AI enthusiasts/SMB. So, other solutions are needed.

I’d expect here reproducing the case of Hadoop and commodity hardware.

Among most powerful solutions of this type is the reference build called DeepLearning11 by STH.

It’s a a single-root design with 10x NVIDIA GeForce GTX 1080 Ti, 4.5U form factor solution. It gives you theoretical peak performance near 100 TFLOPS FP32 and 110 GB memory. 8 Tesla V100 SXM2 will have a total 125 TFLOPS FP32 (attention, it’s not FP16 NVIDIA is reporting usually) and 128 Gb memory. See the HN discussion on it.

The total cost is about $16,500 (I suppose the current costs will be higher because prices grew since the summer 2017). Compare this to DGX-1. Yes, there are no NVLink here, there are maybe no other optimizations, and there are other issues (e.g. it lacks normal FP16 support) but regarding the performance/price ratio, this solution is cool.

If you need such power but cannot afford the DGX-1, the only option right now is to use the cloud. But the cloud is expensive. AWS EC2 p3.16xlarge instances (8*Tesla V100 with NVLink) with the current on-demand price of $24.48 will cost you $17,625 for 30 day run.

The whole DeepLearning11 is cheaper. Yes, add the electricity costs to your own server and other costs. But it’s still cool. The choice is mostly obvious if you have a need to extensive use of GPUs. The cloud is too expensive.

There are other solutions on the market as well (e.g. this one 8*1080 Ti server, and you can easily find more).

From the practical point of view, the good thing is the entry barriers a getting lower.

Right now, if you are targeting at FP32 performance, the most price-efficient video card seems to be the GTX 1080 Ti (now the prices have grown from near $700 to above $1100).

GTX Titan Xp (~$1600) delivers almost similar performance with a bit larger memory (12Gb vs 11Gb) and a bandwidth, but the difference in price isn’t necessarily worth it. BTW do not confuse Titan Xp it with Titan X (Pascal) and Titan X (Maxwell). These are three different cards with different chips and performance. You can see the differences in my GPU Comparison Table.

Titan V ($3000) is 30% faster (again, right now I’m talking about FP32), but too expensive.

Tesla and Quadro cards are extremely expensive, and most researchers and practitioners use gaming cards instead. The only important practical niche for Teslas could be if you are building your own datacenter. NVIDIA prohibited Geforce using in datacenters recently and do not like servers suppliers who use gaming Geforce cards instead of Quadro and Tesla.

If you care not about pure performance but more about performance/price and/or performance/power ratios (which is generally wise), the situation is a bit more complex.

For the performance/power (GFLOPS per Watt) the Volta architecture is the best (even if not taking Tensor Cores/FP16 into account). Among the Pascal architecture the more modest cards like ordinary 1080 (not Ti) and 1070 Ti look better (but not significantly). And remember they have less memory and lower bandwidth.

For the performance/price comparisons you have to constantly recalculate these ratios, because prices change. And moreover there are different manufacturers, so the same 1080 Ti is available in tens of variants. They are not completely the same even in terms of performance. There are cards with increased clock frequency, there are cards specially designed for overclocking. It can give an additional boost in performance, but do not expect it to be huge, e.g. Asus ROG STRIX GTX 1080 Ti OC (overclockers edition) has 11247 FP32 GFLOPS instead of ordinary 1080 Ti with 10609 GFLOPS (so near 6% increase).

I use for myself an already mentioned GPU Comparison Table, and I invite the community to take the flag and expand it (or create a separate service/github repo/whatever) to include different manufacturers, add performance numbers (or at least estimates), and maybe implement regularly updating prices. It could be a useful service for the community.

FP16

It is trickier with FP16 performance. All gaming cards are not an option here, because their FP16 is significantly limited. The only exception seems to be the Titan V. It potentially can achieve 27.6 TFLOPS FP16. It addition it has Tensor Cores with theoretically possible peak performance of 110 TFLOPS. If it were possible to get such a speedup…

Tesla V100 looks like an extremely expensive as other Teslas. If Titan V is really unrestricted in FP16, then the difference between Titan and Tesla diminishes significantly. But it still exists (NVLink and so on).

See the performance charts above.

INT8

If you are planning to run inference on a GPU, then the choice is similar to FP32, a good gaming card looks pretty well.

Again, see the performance charts above.

I promised to talk about AMD.

What about AMD GPUs (I mean Radeon), they seem to be very good (and crypto miners can confirm it), especially keeping in mind their FP16 unrestricted performance (I mean 2x of FP32). Radeon RX Vega 64 promises to deliver up to 23 TFLOPS FP16 performance, which is very good.

If only frameworks can support it…

Originally it started from OpenCL.

There are some attempts like OpenCL Caffe, but unclear plans for Caffe2, no official support in Tensorflow (but there are some unofficial), no support for PyTorch, and so on.

PyTorch team said:

“We officially are not planning any OpenCL work because:

  • AMD itself seems to be moving towards HIP / GPUOpen which has a CUDA transpiler (and they’ve done some work on transpiling Torch’s backend)
  • Intel is moving it’s speed and optimization value into MKLDNN
  • Generic OpenCL support has strictly worse performance than using CUDA/HIP/MKLDNN where appropriate.

Radeon Open Compute Platform (ROCm) is an open-source HPC/Hyperscale-class platform for GPU computing. The current version is 1.7.

Heterogeneous-compute Interface for Portability (HIP), is a C++ runtime API and kernel language that allows developers to create portable applications that can run on AMD and other GPU’s. It allows developers to write applications to a common C++ syntax and API. The resulting C++ code can be compiled with AMD’s HCC and Nvidia’s NVCC. HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms.

Because both Cuda and HIP are C++ languages, porting from Cuda to HIP is much easier than porting from Cuda to OpenCL. To further reduce the learning curve when moving from Cuda to HIP, AMD developed the hipify tool to automate your application’s core conversion.

Comparing to OpenCL, HIP offers several benefits. Here is a syntax comparison between Cuda/HIP/OpenCL.

MIOpen is an AMD’s Machine Intelligence Library, a GPU-accelerated library for machine learning algorithms, that is similar to cuDNN. Here is a porting guide from cuDNN to MIOpen.

https://rocm.github.io/dl.html as for 2018/03/14

ROCm framework support mostly is in progress. Right now only Caffe looks ready (more), and there is an active work on Tensorflow (more).

I think the lack of framework support is the main blocker right now. The GPUs are good and interesting, but it’s hard to use them with the majority of frameworks.

There are some other interesting framework solutions like PlaidML by Vertex.AI. It supports AMD R9 Nano, RX 480, and Vega 10. You can run Keras CNNs on top of PlaidML, but the solution has some significant limitations (e.g. not supporting RNNs). Look at a discussion on HN.

Maybe the projects like NNVM/TVM will help. ROCm backend for AMD GPUs is supported in TVM. It seems the solution is suitable to deploy models already trained in different frameworks (thanks to ONNX/CoreML and MXNet/Keras support) to different hardware, but I am not sure you can use it for training right now.

Source

Infinity Fabric (IF) is a coherent connection (cache coherency is maintained across multiple processors) for use on a chip, across a multi-chip module (MCM), and for inter-socket connectivity.

Comparing IF with NVLink there is a difference. While NVLink can provide fast data transfer between CPUs/GPUs, IF can also be used inside a chip (CPU or GPU).

https://www.nextplatform.com/2017/07/12/heart-amds-epyc-comeback-infinity-fabric/

Using the Infinity Fabric on Vega 10 (the core of Vega 64/56 cards) is part of AMD’s efforts to develop a solid fabric and then use it across the company. Vega 10 is the first AMD graphics processor built using the Infinity Fabric interconnect. In Vega 10, Infinity Fabric links the graphics core and the other main logic blocks on the chip, including the memory controller, the PCI Express controller, the display engine, and the video acceleration blocks. [source]

It seems that right now as an end-user of GPU you cannot have any profit from the IF. Its usage in Vega 10 related to the internal architecture of the chip, not to the communication between CPU/GPU. Correct me if I’m wrong. It is unclear to me whether it gives something or not in a configuration with Threadripper/EPYC + Radeon.

Here is an interesting phrase by Raja Koduri, the [ex-] Senior VP and Chief Architect of AMD’s Radeon Technologies Group (now senior vice president of the Core and Visual Computing Group, general manager of edge computing solutions and chief architect at Intel Corporation): “We haven’t mentioned any multi GPU designs on a single ASIC, like Epyc, but the capability is possible with Infinity Fabric.” he said. And there are speculations that with the next AMD GPU architecture, Navi, there will be multi-chip modules. The same is told about NVIDIA.

For socket-to-socket interconnect IF provides 37.9 GB/s per link, this totals 152 GB/s between sockets. That’s comparable to NVLink 2.0 which gives 150 GB/s.

At CES 2018 AMD announced a new generation of Vega targeted at Deep Learning apps, not gamers:

It’s hard to say anything about what exactly will be the “New DL Ops”, or will “New High Speed I/O” be different from Infinity Fabric. It’s even unclear will we see these cards on the market in the 2018. Waiting for details.

Now the current 2018 GPU roadmap looks like this:

And again, without framework support these GPUs while being cool, will be useless.

Talking about AMD we have to mention Intel as well (especially keeping in mind that AMD Radeon Group chief moved to Intel).

Intel also has GPUs. They are called HD Graphics.

Actually, a significant part of consumer’s chips like Core-i7 is dedicated to it. For example, in 4-core i7–7700 it occupies near the half of the chip surface:

i7–7700 annotated die shot (source)

In i7–7700 there is UHD Graphics 630 with the peak performance of up to 883.2 GFLOPS FP16 (in Boost 1,150 MHz mode).

The current top solution of Gen9.5 GPU microarchitecture, the Iris Plus Graphics 650 has a peak performance up to 1.7664 TFLOPS FP16, which is by the way significantly higher than NVIDIA Titan Xp has (0.1686 TFLOPS FP16, remember, NVIDIA limits FP16 performance on gaming cards), and it’s present even in some Core-i3. For FP32 performance (883.2 GFLOPS) it’s not as cool, but comparable to i7–6850K (we estimated it to be near 690 GFLOPS). So, potential combined (CPU+HD Graphics) performance could be nearly twice as higher.

Remember, the Intel’s Inference Engine supports Intel HD Graphics as well.

It seems that Intel chips, while being far behind NVIDIA/AMD GPUs on a typical DL tasks, are a bit underestimated here. And keeping in mind that there are DL tasks which are far from the peak performance (e.g. RNN training), these opportunities may be interesting.

Even more interesting, is a recent Intel + AMD step to incorporate Vega GPU into new Intel CPUs!

More on it here.

It’s time to finish the post, because it already became a long-read.

Stay in touch, it’s not the end of the story. There could be more interesting alternatives to GPUs soon.

2018/03/14: Published

2018/03/26: Fixed DGX-1 FP64 performance calculation and comparison with supercomputers. Originally FP64 performance was calculated as 1/4 of FP16 performance (1/4 of 1000 TFLOPS = 250 TFLOPS) which is not correct (because that’s tensor cores performance and they do not work with FP64). The correct FP64 performance is 62.4 TFLOPS)

2018/04/03: Fixed error in DGX-1 description (is one place it was mistakenly stated that it has 10, not 8 V100s). Added info on DGX-2 and NVSwitch.

2018/06/29: Added link to a CogX talk by Intel AI on the Roofline model.

2018/09/25: Added some notes on Quadro RTX 6000, 8000 (24 and 48 Gb memory cards) and Quadro GV100 (32 Gb)

2018/09/25: Added info on TESLA T4 and INT4 format.

2018/09/26: Additions regarding new Turing architecture: NVLink support, unrestricted FP16 performance, Tensor cores on board, INT4 support, new performance graphs. See “Sep 2018” comments.