Dataset Number of Samples Size Special
ImageNet1K 1.2M 3x256x256 per image N/A
IWSLT15 133k 20-30 words long per sentence on avg. vocabulary size of 17188
Pascal VOC 2007 5011 around 500x300 per image 12608 annotated objects
LibriSpeech 280k 1000 hours in total N/A
Downsampled ImageNet 1.2M 3x64x64 per image N/A
Gym N/A 210x160x3 per image (Pong) Game dependent


This classical dataset features a collection of 1.2 million labeled images with one thousand object categories used for training data in the image-net competition. The main task in the competition is to classify the object inside an image. The size of all the raw images combined is around 133 GB. Training models for this dataset can be very time-consuming. For example, training ResNet-50 model on this datase for 90 epochs on a NVIDIA M40 GPU can take 2 weeks.

International Workshop on Spoken Language Translation (IWSLT)

This is a machine translation dataset that is focused on the automatic transcription and translation of TED and TEDx talks, i.e. public speeches covering many different topics. Compared with the WMT dataset, mentioned below, this dataset is relatively small (the corpus has 130K sentences) and therefore models should be able to achieve decent BLEU scores fast (in several hours).

Workshop on Statistical Machine Translation (WMT)

This is a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings. The corpus file has around 4M sentences. Full training of a good model will take at least one day.

PASCAL Visual Object Classes (VOC)

This dataset, produced by a group at Oxford University, includes image data for both segmentation and object detection tasks. Given an input image, the segmentation task is to essentially determine for each pixel which object (or background) it belongs to, and the object detection task is to draw a bounding box around each object in the image and classify each object. The number of classes in VOC07 is 20. In TBD, we use the object detection task.

LibriSpeech ASR

LibriSpeech is a speech recognition dataset derived from audiobook recordings containing approximately one thousand hours of 16kHz read English speech. The dataset contains about 280 thousand audio files, each labeled with the corresponding text. The dataset is divided into three parts: a 100-hour set, a 360-hour set, and a 500-hour set. The total size of all the decompressed training data can be up to about 167 GB.

OpenAI Gym

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing various games and simulations. It includes a diverse suite of environments that range from easy to difficult and involve many different kinds of environments, such as classic control tasks, algorithmic tasks, Atari games and 2D and 3D robot simulations.