TBD - Training Benchmark for DNNs

TBD is a new benchmark suite for DNN training that currently covers seven major application domains and nine different state-of-the-art models. The applications in this suite were selected based on extensive conversations with ML developers and users from both industry and academia. For all application domains, we selected recent models capable of delivering state-of-the-art results. We intend to continually expand TBD with new applications and models based on feedback and support from the community.

This is a joint project between the EcoSystem Research Group at University of Toronto and Project Fiddle at Microsoft Research, Redmond.
We also have collaborators from UBC and University of Michigan.

Our benchmark suite is now open sourced on Github.

Application	Model	Number of Layers	Dominant Layer	Implementations	Maintainers
Image classification	ResNet-50 Inception-v3	50 (152 max) 42	CONV	TensorFlow, MXNet, PyTorch	Xin Li
Machine translation	Seq2Seq Transformer	5 12	LSTM Attention	TensorFlow, MXNet, PyTorch	Yu Bo Gao
Language modeling	BERT	24	Attention	PyTorch	Xin Li
Object detection	MaskRCNN EfficientDet	101	CONV	TensorFlow, PyTorch	Yu Bo Gao
Speech recognition	Deep Speech 2	9	RNN	TensorFlow, MXNet, PyTorch	Cong Wei
Adversarial learning	WGAN	14+14	CONV	TensorFlow	Andrew Pelegris
Deep reinforcement learning	MiniGo		CONV	TensorFlow	Cong Wei

(Note that all of the following results were generated on NVIDIA RTX 2080Ti GPU)

Image classification

Image classification is the archetypal deep learning application, as this was the first domain where a deep neural network (AlexNet) proved to be a watershed, beating all prior traditional methods. In our work, we use two very recent models, Inception-v3 and Resnet, which follow a structure similar to AlexNet�s CNN model, but improve accuracy through novel algorithm techniques that enable extremely deep networks.

TBD repository: ResNet Inception-v3

Original models: ResNet-50 Inception-v3

Dataset

Details of ResNet-50

Training curve

Compute Utilization

FP32 Utilization

Throughput

Details of Inception-v3

Training curve

Compute Utilization

FP32 Utilization

Throughput

Machine Translation

Unlike image processing, machine translation involves the analysis of sequential data and typically relies on RNNs using LSTM cells as its core algorithm. We select NMT and Sockeye, developed by the TensorFlow and Amazon Web Service teams, respectively, as representative RNN-based models in this area. We also include an implementation of the recently introduced Transformer model, which achieves a new state-of-the-art in translation quality using attention layers as an alternative to recurrent layers.

TBD repository: NMT Sockeye Transformer

Original models: Seq2Seq Transformer

Datasets: IWSLT WMT

Details of Seq2Seq

Details of Transformer

Language Modeling

Language Modeling refers to the task of determining the probability distribution of a sentence or a sequence of works based on the context provided by the surround words. Language models analyze a body of text, often represented as sequences of word embedding vectors, to predict the likelihood of certain words or phrases. Language models such as BERT and XLNet can greatly improve the accuracy of other downstream NLP tasks such as machine translation and question answering.

This benchmark focuses on BERT(Bidirectional Encoder Representations from Transformers), which is a multi-layer bidirectional Transformer, with only the encoder part. There are many newer LM such as RoBERTa and DistilBERT, but since it architecture has many similarities with the original BERT model, we believe the benchmarking results for BERT is representative to many of such transformer based Language Models.

BERT appears in two training settings: pre-training and fine-tuning. The pre-training phase uses an abundance of text from sources such as Wikipedia and published books (see the dataset link below), and trains for many epochs. The fine-tuning phase locks most of the weights of the BERT model, and attaches a few output layers for the down-stream task, e.g. question-answering. This usually converges really fast since most of the parameters are fixed, as seen in the loss curve in the fine-tuning section.

We perform the benchmark for both FP32 and mixed-precision training, as mixed precision training is commonly seen on BERT due to its high computation demand. We also utilize the TensorCores on our RTX2080 Ti device.

Given the large model size of BERT and the size of the dataset, the pre-training takes many days to train even on a high-end Multi-GPU workstation such as the DGX system. Therefore, we decided to skip the complete pre-training of BERT, but still analyze the convergence for 1 epoch. The training suggests that the model is converging properly. The profiling results is based on 500 iterations, and assumes the same compute behavior for each iteration.

TBD repository: BERT

Original models: BERT

Datasets: Wikipedia/BookCorpus/SQuAD

Details of BERT Pre-training(PyTorch)

Throughput

FP16 Core Utilization

Details of BERT Fine-tuning(PyTorch)

Training Curve

Training Throughput

FP16 Core Utilization

Object Detection

Object detection applications, such as face detection, are another popular deep learning application and can be thought of as an extension of image classification, where an algorithm usually first breaks down an image into regions of interest and then applies image classification to each region. We chose to include Faster R-CNN, which achieves state-of-the-art results on the Pascal VOC datasets, and Mask R- CNN, which instead uses the large-scale coco dataset.

Mask R-CNN improves on Faster R-CNN by improving the rectangular bounding boxes to a pixel-level resolution. This is done in part by adding a branch to the network which outputs whether each pixel is part of a given object. We have chosen ResNet-50 as the pre-trained convolution stack for this model.

TBD repository

Original model Dataset

Details of MaskRCNN

Details of EfficientDet

Speech Recognition

Deep Speech 2 is an end-to-end speech recognition model from Baidu Research. It is able to accurately recognize both English and Mandarin Chinese, two very distant languages, with a unified model architecture and shows great potential for deployment in industry. The Deep Speech 2 model contains two convolutional layers, plus seven regular recurrent layers or Gate Recurrent Units (GRUs), unlike the machine translation RNN models included in our benchmark suite, which use LSTM layers.

TBD repository

Original model Dataset

Details of Deep Speech 2

Throughput

Compute Utilization

FP32 Utilization

FP16 Utilization

Generative Adversarial Networks

A generative adversarial network (GAN) trains two networks: one generator network and one discriminator network. The generator is trained to generate data samples that mimic the real samples, and the discriminator is trained to distinguish whether a data sample is genuine or synthesized. GANs are used, for example, to synthetically generate photographs that look at least superficially authentic to human observers. While GANs are powerful generative models, GAN training suffers from instability. The WGAN model is a milestone that makes great progress towards stable training. Recently Gulrajani et al. proposed an improvement based on the WGAN to enable stable training on a wide range of GAN architectures. We include this model in our benchmark suite, since it is one of the leading DNN algorithms for unsupervised learning.

TBD repository

Original model Dataset

Details of WGAN

Compute Utilization

FP32 Utilization

Throughput

Memory breakdown

Deep Reinforcement Learning

Deep neural networks also drive the recent advances in reinforcement learning, which have contributed to the creation of the first artificial agents to achieve human-level performance across challenging domains, such as the game, Go, and various classical computer games. We include the Minigo in our benchmark suite, which is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo.

TBD repository

Original model

Details of Minigo

Training curve

Throughput

Compute Utilization

FP32 Utilization