TBD - Training Benchmark for DNNs

TBD is a new benchmark suite for DNN training that currently covers six major application domains and eight different state-of-the-art models. The applications in this suite are selected based on extensive conversations with ML developers and users from both industry and academia. For all application domains, we select recent models capable of delivering state-of-the-art results. We intend to continually expand TBD with new applications and models based on feedback and support from the community.

This is a joint project between the EcoSystem Research Group at University of Toronto and Project Fiddle at Microsoft Research, Redmond.
We also have collaborators from UBC and University of Michigan.

Our benchmark suite is now open sourced on Github.



Application Model Number of Layers Dominant Layer Implementations Maintainers
Image classification ResNet-50
Inception-v3
50 (152 max)
42
CONV TensorFlow, MXNet, CNTK Hongyu Zhu
Machine translation Seq2Seq
Transformer
5
12
LSTM
Attention
TensorFlow, MXNet
TensorFlow
Bojian Zhang
Andrew Pelegris
Object detection Faster R-CNN 101 CONV TensorFlow, MXNet Hongyu Zhu
Speech recognition Deep Speech 2 9 RNN MXNet, PyTorch Kuei-Fang Hsueh, Jiahuang Lin
Recommendation system NCF 4 GMF, MLP PyTorch Izaak Niksan
Adversarial learning WGAN 14+14 CONV TensorFlow Andrew Pelegris
Deep reinforcement learning A3C 4 CONV TensorFlow, MXNet Mohamed Akrout

(Note that all the following results are generated on Quadro P4000 GPU)

Image classification

Image classification is the archetypal deep learning application, as this was the first domain where a deep neural network (AlexNet) proved to be a watershed, beating all prior traditional methods. In our work, we use two very recent models, Inception-v3 and Resnet, which follow a structure similar to AlexNet’s CNN model, but improve accuracy through novel algorithm techniques that enable extremely deep networks.

Training curve

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown


Training curve

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown



Machine Translation

Unlike image processing, machine translation involves the analysis of sequential data and typically relies on RNNs using LSTM cells as its core algorithm. We select NMT and Sockeye, developed by the TensorFlow and Amazon Web Service teams, respectively, as representative RNN-based models in this area. We also include an implementation of the recently introduced Transformer model, which achieves a new state-of-the-art in translation quality using attention layers as an alternative to recurrent layers.

Training curve (time in hours)

Training curve (time in epochs)

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown


Compute Utilization

FP32 Utilization

Throughput

Memory breakdown



Object Detection

Object detection applications, such as face detection, are another popular deep learning application and can be thought of as an extension of image classification, where an algorithm usually first breaks down an image into regions of interest and then applies image classification to each region. We chose to include Faster R-CNN, which achieves state-of-the-art results on the Pascal VOC datasets. A training iteration consists of the forward and backward passes of two networks (one for identifying regions and one for classification), weight sharing and local fine-tuning. The convolution stack in a Faster R-CNN network is usually a standard image classification network, in our work: a 101-layer ResNet.
MXNet results:
Training quality: 70+% mAP so far
Throughtput: 2.29 samples/sec
GPU Compute Utilizaiton: 90.29%
FP32 Utilization: 70.9%

TensorFlow results:
Training quality: 65+% mAP so far
Throughtput: 2.32 samples/sec
GPU Compute Utilizaiton: 89.41%
FP32 Utilization: 58.9%%


Speech Recognition

Deep Speech 2 is an end-to-end speech recognition model from Baidu Research. It is able to accurately recognize both English and Mandarin Chinese, two very distant languages, with a unified model architecture and shows great potential for deployment in industry. The Deep Speech 2 model contains two convolutional layers, plus seven regular recurrent layers or Gate Recurrent Units (GRUs), unlike the machine translation RNN models included in our benchmark suite, which use LSTM layers..

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown



Generative Adversarial Networks

A generative adversarial network (GAN) trains two networks, one generator network and one discriminator network. The generator is trained to generate data samples that mimic the real samples, and the discriminator is trained to distinguish whether a data sample is genuine or synthesized. GANs are used, for example, to synthetically generate photographs that look at least superficially authentic to human observers. While GANs are powerful generative models, GAN training suffers from instability. The WGAN is a milestone as it makes great progress towards stable training. Recently Gulrajani et al. proposes an improvement based on the WGAN to enable stable training on a wide range of GAN architectures. We include this model in our benchmark suite, since it is one of the leading DNN algorithms for unsupervised learning.

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown



Deep Reinforcement Learning

Deep neural networks also drive the recent advances in reinforcement learning, which have contributed to the creation of the first artificial agents to achieve human-level performance across challenging domains, such as the game of Go and various classical computer games. We include the A3C algorithm in our benchmark suite, as it has become one of the most popular deep reinforcement learning techniques, surpassing the DQN training algorithms, and works in both single and distributed machine settings. A3C relies on asynchronously updated policy and value function networks trained in parallel over several processing threads.

Training curve

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown