Performance Impression by GPU and TPU (w/ screen recording)

Everybody knows the performance of deep AI workloads depends a lot on devices like GPUs. But how fast is it ? TPU has really robust performance compared with GPUs ?
Here I show you the insight for these concerns by simple experimentation.

For example, the following is the demonstration for running same TensorFlow training task (ResNet network for CIFAR-10 dataset) on both CPU (left side) and NVIDIA Tesla K80 (right side). Here we used “n1-highmem-8” VM instance on Google Compute Engine.
As you can easily see, it apparently needs a lot of time to train on CPU architecture. (It’s obvious without need to run detailed benchmark testing.)

In this post, I just simply show you its performance impact of single device with the same example of code for your intuition. (It will be very simple benchmarks.) I don’t discuss about detailed tuning techniques such as memory layouts or parallelism for model optimization, but I hope this can give you a sense for performance impact by devices.

Here I use CIFRA-10 (50,000 training images which size is 32 x 32 by 3 RGB channels) for dataset and train using ResNet-32 convolutional network (without bottleneck) which learning rate is scheduled : 0.1 (< 40,000 steps), 0.01 (< 60,000 steps), 0.001 (< 80,000 steps), 0.0001 (>= 80,000 steps).

The layers of this network is not so deep and data is not so large such as ImageNet, but it will be enough to understand how the device effects the performance here.

Performance Comparison (Steps per Second)

For performance comparison, I simply show you the transition of {the number of training steps}/{second} for the first thousands of steps with TensorBoard screen-capturing. (It’s not difficult and you can soon trace the same procedure.)

For example, the previous case (comparison with CPU and K80) achieves the following performance. As you can see, it is about 8x – 9x faster than CPU by using NVIDIA Tesla K80 utilized device.

Following is the result for the latest NVIDIA Tesla V100 (Volta architecture).
Currently Google doesn’t offer V100 instance as Compute VM and here I used Azure Virtual Machines NC6s_v3 for benchmarks.

As you can see, now it achieves about 50x faster compared with the general purpose CPU device (previous n1-highmem-8 machine).

Tesla V100 (Volta architecture)

The intuitive speed is like that :

Finally, I show you the result using Google Cloud TPU (TPUv2 and TPU3). Note that here I specified 1 replica, but 8 replicas are the expected behavior for using Cloud TPUs. (Because one TPU consists of 2 chips and a single chip contains 2 cores, totally 2 x 4 cores = 8 cores. See “Cloud TPU – Troubleshooting and FAQ” for details. You can also use TPU pods for optimized massive device communications.)
As you can see, the speed of TPU v2 is about 2x faster than the latest Volta GPU (V100) for 1 device (not distributed) and TPU v3 is more.

TPU v2 (1 replica)

TPU v3 (1 replica)

In Conclusion

So which one is the best choice ?

As we saw earlier, this post shows that TPU is the fastest choice for the training by TensorFlow.

But it’s important to remember that TPU has also some caveats, such as :

  • TPU supports only TensorFlow (incl. Keras on TensorFlow). You cannot bring Theano, CNTK, Caffe or other major frameworks on TPU devices.
  • Currently the programming code depends on TPU and therefore you cannot debug your code on other devices.

For instance, it’s well known that Cognitive Toolkit (CNTK) is 2x – 5x faster than TensorFlow when using RNN (incl. LSTM) and it might sometimes be the same speed or faster to run CNTK on V100 rather than TensorFlow on TPU. Currently PyTorch is also popular framework, but TPU doesn’t fully support. (Now alpha release for PyTorch on TPU has just started. See Google’s blog post.) Or you can also linearly scale and reach your goal using Horovod together with TensorFlow and your familiar devices.

Moreover you must remember that this result is only for training workloads. It seems that TPU is focusing for training rather than inference (though the document says that it can be used for inference too). For real-time predictions or online predictions (which is hot topics nowadays), you can take other choices like TensorRT with Tensor Core architecture on V100 (NVIDIA says it’s faster than TPU), or Microsoft FPGA technologies (also Microsoft says it’s faster than TPU), so on and forth.

In conclusion, the latest P100, V100 and TPU has surely remarkable performance against old architectures, and these are absolutely helpful for your AI workloads with deep neural layers, but you must carefully consider which one is better in these latest devices, depending on various aspects of real usage such as : the shape of generated graph, what framework you choose (skills and learning curve for your team), human workloads in AI lifecycle, required numbers of distributions (devices), cost effectiveness, etc, etc.

Currently (April 2018) AWS EC2 p3.2xlarge is $3.06/hour, Azure VM NCv3 is $3.06/hour and TPU is $6.50/hour (v3 is $8.00/hour) plus VM instance cost.


Update History :

02/06/2019  Added for TPU v3