Datasets

Dataset	Number of Samples	Size	Special
ImageNet1K	1.2M	3x256x256 per image	N/A
IWSLT15	133k	20-30 words long per sentence on avg.	vocabulary size of 17188
Pascal VOC 2007	5011	around 500x300 per image	12608 annotated objects
COCO 2014	164 062	around 640x400 per image	886k segmented object instances
LibriSpeech	280k	1000 hours in total	N/A
Down-sampled ImageNet	1.2M	3x64x64 per image	N/A
Gym	N/A	210x160x3 per image (Pong)	Game dependent
MovieLens 20M	20000263	Rating between 1 and 5	Rounded to intervals of 0.5

ImageNet1K

Homepage

This classic dataset features a collection of 1.2 million labeled images with one thousand object categories used for training data in the ImageNet competition. The main task in the competition is to classify the object inside an image. The size of all the raw images combined is around 133 GB. Training models for this dataset can be very time-consuming. For example, training a ResNet-50 model on this database for 90 epochs on a NVIDIA M40 GPU can take 2 weeks.

International Workshop on Spoken Language Translation (IWSLT)

Homepage

This is a machine translation dataset that is focused on the automatic transcription and translation of TED and TEDx talks, i.e. public speeches covering many different topics. Compared with the WMT dataset, mentioned below, this dataset is relatively small (the corpus has 130K sentences) and therefore models should be able to achieve decent BLEU scores fast (in several hours).

Dataset for BERT

For Pre-training

WikiExtractor

BookCorpus

The two datasets include raw text from Wikipedia and various published books. The text is then cleaned (removes html tags and non-text metadata) before feeding into the model.

For Fine-tuning

SQuAD

As quoted in the original documentation: "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."" source

Workshop on Statistical Machine Translation (WMT)

Homepage

This is a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings. The corpus file has around 4M sentences. Full training of a good model will take at least one day.

PASCAL Visual Object Classes (VOC)

Homepage

This dataset, produced by a group at Oxford University, includes image data for both segmentation and object detection tasks. Given an input image, the segmentation task is to essentially determine for each pixel which object (or background) it belongs to, and the object detection task is to draw a bounding box around each object in the image and classify each object. The number of classes in VOC07 is 20. We use the object detection task in TBD.

COCO

Homepage

The COCO dataset, which stands for Common Objects in Context, consists of everyday scenes ranging from the busy streets of a city to animals on a hillside. The 2014 version, used by TBD, has 80 object categories of labeled and segmented images. This dataset contains 82 783 training, 40 504 validation, and 40 775 testing images. There are nearly 270k segmented people and 886k total segmented object instances in the 2014 training and validation datasets alone.

LibriSpeech ASR

Homepage

LibriSpeech is a speech recognition dataset derived from audiobook recordings containing approximately one thousand hours of 16kHz read English speech. The dataset contains about 280 thousand audio files, each labeled with the corresponding text. The dataset is divided into three parts: a 100-hour set, a 360-hour set, and a 500-hour set. The total size of all the decompressed training data can be up to approximately 167 GB.

OpenAI Gym

Homepage

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing various games and simulations. It includes a diverse suite of environments that range from easy to difficult and involve many kinds of environments, such as classic control tasks, algorithmic tasks, Atari games and 2D and 3D robot simulations.

MovieLens 20M

Homepage

The ml-20m dataset used for the NCF model consists of 5-star ratings from MovieLens, an online service which recommends movies for its users to watch. The raw dataset has 20 000 263 ratings across 27 278 movies, and was created from 27 278 users between January 09, 1995 and March 31, 2015. Each user represented in the dataset has rated at least 20 movies.