Datasets
Dataset | Number of Samples | Size | Special |
---|---|---|---|
ImageNet1K | 1.2M | 3x256x256 per image | N/A |
IWSLT15 | 133k | 20-30 words long per sentence on avg. | vocabulary size of 17188 |
Pascal VOC 2007 | 5011 | around 500x300 per image | 12608 annotated objects |
COCO 2014 | 164 062 | around 640x400 per image | 886k segmented object instances |
LibriSpeech | 280k | 1000 hours in total | N/A |
Down-sampled ImageNet | 1.2M | 3x64x64 per image | N/A |
Gym | N/A | 210x160x3 per image (Pong) | Game dependent |
MovieLens 20M | 20000263 | Rating between 1 and 5 | Rounded to intervals of 0.5 |
ImageNet1K
This classic dataset features a collection of 1.2 million labeled
images with one thousand object categories used for
training data in the ImageNet competition. The main task
in the competition is to classify the object inside an image. The size of all the raw images combined is around 133 GB.
Training models for this dataset can be very time-consuming. For example, training a ResNet-50 model on this database for 90 epochs on a NVIDIA M40 GPU can take 2 weeks.
International Workshop on Spoken Language Translation (IWSLT)
This is a machine translation dataset that is focused on the automatic
transcription and translation of TED and TEDx talks, i.e.
public speeches covering many different topics.
Compared with the WMT dataset, mentioned below, this dataset
is relatively small (the corpus has 130K sentences) and
therefore models should be able to achieve decent BLEU scores fast (in several hours).
Dataset for BERT
For Pre-training
The two datasets include raw text from Wikipedia and various published books. The text is then
cleaned (removes html tags and non-text metadata) before feeding into the model.
For Fine-tuning
As quoted in the original documentation:
"Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.""
source
Workshop on Statistical Machine Translation (WMT)
This is a machine translation dataset composed from a
collection of various sources, including news commentaries
and parliament proceedings. The corpus file has around 4M sentences.
Full training of a good model will take at least one day.
PASCAL Visual Object Classes (VOC)
This dataset, produced by a group at Oxford University,
includes image data for both segmentation and object detection
tasks. Given an input image, the segmentation task is to essentially determine for each pixel which object (or background) it belongs to,
and the object detection task is to draw a bounding box around each object in the image and classify each object. The number of classes in VOC07 is 20. We use the object detection task in TBD.
COCO
The COCO dataset, which stands for Common Objects in Context, consists of everyday scenes ranging from the busy streets of a city to animals on a hillside.
The 2014 version, used by TBD, has 80 object categories of labeled and segmented images. This dataset contains 82 783 training, 40 504 validation, and 40 775
testing images. There are nearly 270k segmented people and 886k total segmented object instances in the 2014 training and validation datasets alone.
LibriSpeech ASR
LibriSpeech is a speech recognition dataset derived from
audiobook recordings containing
approximately one thousand hours of 16kHz read English speech. The dataset contains about 280 thousand audio files, each labeled with the corresponding text.
The dataset is divided into three parts: a 100-hour set, a 360-hour set, and a 500-hour set.
The total size of all the decompressed training data can be up to approximately 167 GB.
OpenAI Gym
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing various games and simulations. It includes a diverse suite of environments that range from easy to difficult and involve many kinds of environments, such as classic control tasks, algorithmic tasks, Atari games and 2D and 3D robot simulations.
MovieLens 20M
The ml-20m dataset used for the NCF model consists of 5-star ratings from MovieLens, an online service which recommends movies for its users to watch. The raw dataset has 20 000 263 ratings across 27 278 movies, and was created from 27 278 users between January 09, 1995 and March 31, 2015. Each user represented in the dataset has rated at least 20 movies.