Image classification is the archetypal
deep learning application, as this was the first domain where a deep
neural network (AlexNet) proved to be a watershed, beating
all prior traditional methods. In our work, we use two very recent
models, Inception-v3 and Resnet, which follow a structure
similar to AlexNet’s CNN model, but improve accuracy through
novel algorithm techniques that enable extremely deep networks.
Language Modeling refers to the task of determining the
probability distribution of a sentence or a sequence of works based on the context provided by the
surround words. Language models analyze a body of text, often represented as sequences of word embedding vectors, to predict
the likelihood of certain words or phrases. Language models such as BERT and XLNet can greatly improve the accuracy of other
downstream NLP tasks such as machine translation and question answering.
This benchmark focuses on BERT(Bidirectional Encoder Representations from Transformers), which is a multi-layer bidirectional Transformer, with only the encoder part. There are many newer LM such as RoBERTa and DistilBERT, but since it architecture has many similarities with
the original BERT model, we believe the benchmarking results for BERT is representative to many of such transformer based Language Models.
BERT appears in two training settings: pre-training and fine-tuning. The pre-training phase uses an abundance of text from
sources such as Wikipedia and published books (see the dataset link below), and trains for many epochs. The fine-tuning phase locks
most of the weights of the BERT model, and attaches a few output layers for the down-stream task, e.g. question-answering. This usually converges really fast since most of the parameters are fixed, as seen in the loss curve in the fine-tuning section.
We perform the benchmark for both FP32 and mixed-precision training, as mixed precision training is commonly seen on BERT
due to its high computation demand. We also utilize the TensorCores on our RTX2080 Ti device.
Given the large model size of BERT and the size of the dataset, the pre-training takes many days to train even on
a high-end Multi-GPU workstation such as the DGX system. Therefore, we decided to skip the complete pre-training of BERT,
but still analyze the convergence for 1 epoch. The training suggests that the model is converging properly.
The profiling results is based on 500 iterations, and assumes the same compute behavior for each iteration.
FP16 Core Utilization
FP16 Core Utilization
Object detection applications, such as face detection, are another popular deep learning application and
can be thought of as an extension of image classification, where an algorithm usually first breaks down
an image into regions of interest and then applies image classification to each region. We chose to
include Faster R-CNN, which achieves state-of-the-art results on the Pascal VOC datasets, and Mask R-
CNN, which instead uses the large-scale coco dataset.
Mask R-CNN improves on Faster R-CNN by improving the rectangular bounding boxes to a pixel-level
resolution. This is done in part by adding a branch to the network which outputs whether each pixel
is part of a given object. We have chosen ResNet-50 as the pre-trained convolution stack for this model.
Deep Speech 2 is an end-to-end
speech recognition model from Baidu Research. It is able to accurately
recognize both English and Mandarin Chinese, two very
distant languages, with a unified model architecture and shows
great potential for deployment in industry. The Deep Speech 2
model contains two convolutional layers, plus seven regular recurrent
layers or Gate Recurrent Units (GRUs), unlike the machine translation RNN
models included in our benchmark suite, which use LSTM layers.
Generative Adversarial Networks
A generative adversarial
network (GAN) trains two networks: one generator network and
one discriminator network. The generator is trained to generate
data samples that mimic the real samples, and the discriminator
is trained to distinguish whether a data sample is genuine or synthesized.
GANs are used, for example, to synthetically generate
photographs that look at least superficially authentic to human
While GANs are powerful generative models, GAN training
suffers from instability. The WGAN model is a milestone that makes
great progress towards stable training. Recently Gulrajani et al.
proposed an improvement based on the WGAN to enable stable
training on a wide range of GAN architectures. We include this
model in our benchmark suite, since it is one of the leading DNN
algorithms for unsupervised learning.
Deep Reinforcement Learning
Deep neural networks also drive the recent advances in reinforcement learning,
which have contributed to the creation of the first artificial agents
to achieve human-level performance across challenging domains,
such as the game, Go, and various classical computer games. We
include the Minigo in our benchmark suite, which is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo.