Delve Datasets

Collections of data for developing, evaluating, and comparing learning methods.

Previous SectionDelveNext SectionDelve

U. of T.

An important note to users with version 1.0 of the software.

--------------------------------------------------------------------

The Delve datasets and families are available from this page. Every dataset (or family) has a brief overview page and many also have detailed documentation. You can download gzipped-tar files of the datasets, but you will require the delve software environment to get maximum benefit from them. Datasets are categorized as primarily assessment, development or historical according to their recommended use. Within each category we have distinguished datasets as regression or classification according to how their prototasks have been created. Details on how to install the downloaded datasets are given below .

There is also a summary table of the datasets.

--------------------------------------------------------------------

Assessment Datasets

Datasets from this section are recommended to be used when reporting results for your learning method. You should run your method once on each task and report the results from that run. That is, you should not use results from the testing data to modify your method, then re-run it.

Regression Datasets

  1. abalone. Download abalone.tar.gz
    Predict the age of abaolone from physical measurements. From the UCI repository of machine learning databases.
  2. bank. Download bank-family
    A family of datasets synthetically generated from a simulation of how bank-customers choose their banks. Tasks are based on predicting the fraction of bank customers who leave the bank because of full queues.
  3. census-house. Download census-house.tar.gz
    Predicting median house prices from 1990 US census data.
  4. comp-activ. Download comp-activ.tar.gz
    Predict a computer system activity from system performance measures..
  5. pumadyn family of datasets. Download pumadyn-family
    This is a family of datasets synthetically generated from a realistic simulation of the dynamics of a Unimation Puma 560 robot arm.

Classification Datasets

  1. adult. Download adult.tar.gz
    Predict if an individual's annual income exceeds $50,000 based on census data. From the UCI repository of machine learning databases.
  2. splice. Download splice.tar.gz
    Recognize two classes of splice junctions in a DNA sequence. From the UCI repository of machine learning databases.
  3. titanic. Download titanic.tar.gz
    Information on passengers of the Titanic and whether they survived

--------------------------------------------------------------------

Development Datasets

We recommend that you use datasets from this section while developing a new learning method, or fine-tuning parameters. That is, you can re-run your method several times on a dataset until you obtain the desired performance. If you do use a dataset in this manner, you should not use it when reporting your method's performance: you should use datasets from the Assessment section.

Regression Datasets

  1. boston. Download boston.tar.gz
    Housing in the Boston Massachusetts area. From the UCI repository of machine learning databases.
  2. demo. Download demo.tar.gz
    The demo dataset was invented to serve as an example for the Delve manual and as a test case for Delve software and for software that applies a learning procedure to Delve datasets.
  3. kin family of datasets. Download kin-family
    This is a family of datasets synthetically generated from a realistic simulation of the forward kinematics of an 8 link all-revolute robot arm.

Classification Datasets

  1. image-seg. Download image-seg.tar.gz
    Predict the object class of a 3x3 patch from an image of an outdoor scence. From the UCI repository of machine learning databases.
  2. letter. Download letter.tar.gz
    Classify an image as one of 26 upper case letters. The inputs are simple statistical features derived from the pixels in the image. From the UCI repository of machine learning databases.
  3. The mushrooms dataset. Download mushrooms.tar.gz
    Classify hypothetical samples of gilled mushrooms in the Agaricus and Lepiota family as edible or poisonous. From the UCI repository of machine learning databases.

--------------------------------------------------------------------

Historical Datasets

Datasets from this section have been included because they are established in the literature. We have attempted to reproduce the original usage as closely as possible to facilitate comparisons.

Regression Datasets

  1. add10. Download add10.tar.gz
    A synthetic function suggested by Jerome Friedman in his "Multivariate Adaptive Regression Splines paper.
  2. hwang. Download hwang.tar.gz -
    Five real-valued functions of two variables used by Jenq-Neng Hwang, et al and others to test nonparametric regression methods. Both noisy and noise-free prototasks are defined based on these functions.

Classification Datasets

  1. ringnorm. Download ringnorm.tar.gz
    Leo Breiman's ringnorm example. Classify cases as coming from one of two overlapping normal distributions.
  2. twonorm. Download twonorm.tar.gz
    Leo Breiman's two normal example. Classify a case as coming from one of 2 normal distribution, one distribution lies within the other.

--------------------------------------------------------------------

Installation of Datasets

Before you can install the datasets, you must build and install the Delve utilities.

Once you've done that, you can install the datasets. This involves simply extracting the files from their tape archives into the proper directory: the installed top-level Delve data directory. By default this directory is "/usr/local/lib/delve/data".

If you used the "--prefix" option with the "configure" command used to build the delve utilites, replace the "/usr/local" part of the above path with that prefix.

Each tape archive will create a directory with the same base name as the archive file. This directory will contain all the data and specification files Delve needs to generate the tasks.

mv demo.tar.gz /usr/local/lib/delve/data
cd /usr/local/lib/delve/data
gunzip demo.tar.gz 
tar -xvf demo.tar

If you want to install a dataset in a private directory, you can do the following:

  1. Create a directory called delve in your home directory (or anywhere else, for that matter).
  2. In that directory create two more directories: data and methods.
  3. In the delve/data directory, untar the data file as described above.

Once you've done that, you can work in your own private delve directory and you will have access to the datasets you've downloaded, as well the ones installed in /usr/local/lib/delve/data.

Once you've extracted the data, you can safely remove the tar file.


Last modified: Mon May 26 19:28:22 EDT
Comments and questions to: delve@cs.toronto.edu
Copyright