TensorFlow Performance by GPU and TPU (w/ screen recording)

Everybody knows the performance of deep AI workloads depends a lot on devices like GPUs. But how fast is it ? TPU has really robust performance compared with GPUs ?
Here I show you the insight for these concerns by simple experimentation.

For example, the following is the demonstration for running same TensorFlow training task (ResNet network for CIFAR-10 dataset) on both CPU (left side) and NVIDIA Tesla K80 (right side). Here we used “n1-highmem-8” VM instance on Google Compute Engine.
As you can easily see, it apparently needs a lot of time to train on CPU architecture. (It’s obvious without need to run detailed benchmark testing.)

In this post, I just simply show you its performance impact of single device with the same example of code for your intuition. (It will be very simple benchmarks.) I don’t discuss about detailed tuning techniques such as memory layouts or parallelism for model optimization, but I hope this can give you a sense for performance impact by devices.

Here I use CIFRA-10 (50,000 training images which size is 32 x 32 by 3 RGB channels) for dataset and train using ResNet-32 convolutional network (without bottleneck) which learning rate is scheduled : 0.1 (< 40,000 steps), 0.01 (< 60,000 steps), 0.001 (< 80,000 steps), 0.0001 (>= 80,000 steps).

The layers of this network is not so deep and data is not so large such as ImageNet, but it will be enough to understand how the device effects the performance here.

Performance (Steps per Second)

For performance comparison, I simply show you the transition of {the number of training steps}/{second} for the first thousands of steps with TensorBoard screen-capturing. (It’s not difficult and you can soon trace the same procedure.)

For example, the previous case (comparison with CPU and K80) achieves the following performance. As you can see, it is about 8x – 9x faster than CPU by using NVIDIA Tesla K80 utilized device.

Following is the result for the latest NVIDIA Tesla V100 (Volta architecture).
Currently Google doesn’t offer V100 instance as Compute VM and here I used Azure Virtual Machines NC6s_v3 for benchmarks.

As you can see, now it achieves about 50x faster compared with the general purpose CPU device (previous n1-highmem-8 machine).

The intuitive speed is like that :

Finally, I show you the result using Google Cloud TPU (TPUv2). Note that here I specified 1 replica, but 8 replicas are the expected behavior for using Cloud TPUs. (See “Cloud TPU – Troubleshooting and FAQ” for details.)
As you can see, the speed is about 2x faster than the latest Volta GPU (V100) for 1 device (not distributed).

In Conclusion

So which one is the best choice ?

As we saw earlier, this post shows that TPU is the fastest choice for TensorFlow, but it’s important to remember that TPU has some caveats, such as :

  • TPU supports only TensorFlow (incl. Keras on TensorFlow) and you cannot bring Theano, CNTK, Caffe or other major frameworks on TPU devices.
  • Currently the programming code depends on TPU and therefore you cannot debug your code on other devices.

For instance, it’s well known that Cognitive Toolkit (CNTK) is 2x – 5x faster than TensorFlow when using RNN (incl. LSTM). Depending on the real situations, it sometimes might be the same speed or faster to run CNTK on V100 rather than TensorFlow on TPU, because TPU is not so remarkable (it’s still close) compared with the latest GPU architecture.
Here I don’t go so far, but you can also take the latest multiple GPU communications’ technologies with NVIDIA software for speed-up.

Moreover here I showed only training performance, but, during inference (real-time predictions or online predictions), you can take other choices like TensorRT, Tensor Core architecture on V100 (NVIDIA says it’s faster than TPU), or Microsoft FPGA technologies (also Microsoft says it’s faster than TPU).

In conclusion, the latest P100, V100 or TPU has surely remarkable performance against old architectures, and these are helpful for your AI workloads with deep neural layers. But, for the latest V100 and TPUs, you must carefully consider which one is better depending on various aspects of real usage, such as the shape of generated graph, what framework you choose, human workloads in AI lifecycle, required numbers of distributions (devices), cost effectiveness, etc, etc.

Note that currently (April 2018) AWS EC2 p3.2xlarge is $3.06/hour (computing only), Azure VM NCv3 is $3.06/hour (computing only) and TPU is $6.50/hour (device consumption only).