TBD - Training Benchmark for DNNs

TBD is a new benchmark suite for DNN training that currently covers seven major application domains and nine different state-of-the-art models. The applications in this suite were selected based on extensive conversations with ML developers and users from both industry and academia. For all application domains, we selected recent models capable of delivering state-of-the-art results. We intend to continually expand TBD with new applications and models based on feedback and support from the community.

This is a joint project between the EcoSystem Research Group at University of Toronto and Project Fiddle at Microsoft Research, Redmond.
We also have collaborators from UBC and University of Michigan.

Our benchmark suite is now open sourced on Github.



Application Model Number of Layers Dominant Layer Implementations Maintainers
Image classification ResNet-50
Inception-v3
50 (152 max)
42
CONV TensorFlow, MXNet, PyTorch Xin Li
Machine translation Seq2Seq
Transformer
5
12
LSTM
Attention
TensorFlow, MXNet, PyTorch Yu Bo Gao
Language modeling BERT 24 Attention PyTorch Xin Li
Object detection MaskRCNN
EfficientDet
101 CONV TensorFlow, PyTorch Yu Bo Gao
Speech recognition Deep Speech 2 9 RNN TensorFlow, MXNet, PyTorch Cong Wei
Adversarial learning WGAN 14+14 CONV TensorFlow Andrew Pelegris
Deep reinforcement learning MiniGo CONV TensorFlow Cong Wei

(Note that all of the following results were generated on NVIDIA RTX 2080Ti GPU)

Image classification

Image classification is the archetypal deep learning application, as this was the first domain where a deep neural network (AlexNet) proved to be a watershed, beating all prior traditional methods. In our work, we use two very recent models, Inception-v3 and Resnet, which follow a structure similar to AlexNet’s CNN model, but improve accuracy through novel algorithm techniques that enable extremely deep networks.

Training curve

Compute Utilization

FP32 Utilization

Throughput


Training curve

Compute Utilization

FP32 Utilization

Throughput



Machine Translation

Unlike image processing, machine translation involves the analysis of sequential data and typically relies on RNNs using LSTM cells as its core algorithm. We select NMT and Sockeye, developed by the TensorFlow and Amazon Web Service teams, respectively, as representative RNN-based models in this area. We also include an implementation of the recently introduced Transformer model, which achieves a new state-of-the-art in translation quality using attention layers as an alternative to recurrent layers.



Language Modeling

Language Modeling refers to the task of determining the probability distribution of a sentence or a sequence of works based on the context provided by the surround words. Language models analyze a body of text, often represented as sequences of word embedding vectors, to predict the likelihood of certain words or phrases. Language models such as BERT and XLNet can greatly improve the accuracy of other downstream NLP tasks such as machine translation and question answering.

This benchmark focuses on BERT(Bidirectional Encoder Representations from Transformers), which is a multi-layer bidirectional Transformer, with only the encoder part. There are many newer LM such as RoBERTa and DistilBERT, but since it architecture has many similarities with the original BERT model, we believe the benchmarking results for BERT is representative to many of such transformer based Language Models.

BERT appears in two training settings: pre-training and fine-tuning. The pre-training phase uses an abundance of text from sources such as Wikipedia and published books (see the dataset link below), and trains for many epochs. The fine-tuning phase locks most of the weights of the BERT model, and attaches a few output layers for the down-stream task, e.g. question-answering. This usually converges really fast since most of the parameters are fixed, as seen in the loss curve in the fine-tuning section.

We perform the benchmark for both FP32 and mixed-precision training, as mixed precision training is commonly seen on BERT due to its high computation demand. We also utilize the TensorCores on our RTX2080 Ti device.

Given the large model size of BERT and the size of the dataset, the pre-training takes many days to train even on a high-end Multi-GPU workstation such as the DGX system. Therefore, we decided to skip the complete pre-training of BERT, but still analyze the convergence for 1 epoch. The training suggests that the model is converging properly. The profiling results is based on 500 iterations, and assumes the same compute behavior for each iteration.

Throughput

FP16 Core Utilization


Training Curve

Training Throughput

FP16 Core Utilization



Object Detection

Object detection applications, such as face detection, are another popular deep learning application and can be thought of as an extension of image classification, where an algorithm usually first breaks down an image into regions of interest and then applies image classification to each region. We chose to include Faster R-CNN, which achieves state-of-the-art results on the Pascal VOC datasets, and Mask R- CNN, which instead uses the large-scale coco dataset.

Mask R-CNN improves on Faster R-CNN by improving the rectangular bounding boxes to a pixel-level resolution. This is done in part by adding a branch to the network which outputs whether each pixel is part of a given object. We have chosen ResNet-50 as the pre-trained convolution stack for this model.



Speech Recognition

Deep Speech 2 is an end-to-end speech recognition model from Baidu Research. It is able to accurately recognize both English and Mandarin Chinese, two very distant languages, with a unified model architecture and shows great potential for deployment in industry. The Deep Speech 2 model contains two convolutional layers, plus seven regular recurrent layers or Gate Recurrent Units (GRUs), unlike the machine translation RNN models included in our benchmark suite, which use LSTM layers.

Throughput

Compute Utilization

FP32 Utilization

FP16 Utilization



Generative Adversarial Networks

A generative adversarial network (GAN) trains two networks: one generator network and one discriminator network. The generator is trained to generate data samples that mimic the real samples, and the discriminator is trained to distinguish whether a data sample is genuine or synthesized. GANs are used, for example, to synthetically generate photographs that look at least superficially authentic to human observers. While GANs are powerful generative models, GAN training suffers from instability. The WGAN model is a milestone that makes great progress towards stable training. Recently Gulrajani et al. proposed an improvement based on the WGAN to enable stable training on a wide range of GAN architectures. We include this model in our benchmark suite, since it is one of the leading DNN algorithms for unsupervised learning.

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown



Deep Reinforcement Learning

Deep neural networks also drive the recent advances in reinforcement learning, which have contributed to the creation of the first artificial agents to achieve human-level performance across challenging domains, such as the game, Go, and various classical computer games. We include the Minigo in our benchmark suite, which is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo.

Training curve

Throughput

Compute Utilization

FP32 Utilization