delve abalone dataset detail

The Demo dataset

The "demo" dataset was invented to serve as an example for the Delve manual and as a test case for Delve software and for software that applies a learning procedure to Delve datasets. To those ends, it has a variety of attributes. The rule for generating cases was based on various stereotypical notions, which may or may not have any basis in reality, in an effort to make the characteristics of the data more easily remembered, and not completely arbitrary.

Each case in the dataset describes a person from an imaginary population, with attributes giving the person's sex, age, income, number of siblings, and favourite colour (from among pink, blue, red, green, and purple). The attributes of each person are generated in the order given, independently of the attributes of other persons.

If you want, you can download the dataset demo.tar.gz, or get a list of learning methods that have been run on this dataset.

The dataset was generated by Radford Neal specifically for the Delve project.

Model Design

In detail, the procedure used to generate the attributes is as follows:

Attribute 1: sex

Sex is chosen randomly with probability 0.53 for female.

Attribute 2: age

Age is set to the absolute value of a normal random variate with mean zero, and a standard deviation of 40 for females and 30 for males.

Attribute 3: siblings

The number of siblings is picked by truncating an exponentially distributed variate to an integer, with the mean of the exponential distributon being 3*(1+age)/(3+age).

Attribute 4: income

Income is the sum of employment income and other income, but only the total income is recorded, not the two components.

To determine employment income, an unobserved binary "working" flag is first selected randomly. For males, the probability of working is:

0.9/((1+exp(-(age-20))*(1+exp((age-60)/5));

for females, it is:

0.8/((1+exp(-(age-20))*(1+exp((age-65)/5)).

If the person is working, their employment income is drawn from the exponential distribution with mean

(C-5000*siblings/(5+siblings)) / (1+exp(-(age-25)/5))

where C is 30000 for females and 40000 for males. If the person is not working, their employment income is zero.

The person's other income has an exponential distribution with mean of age*100. The value randomly picked from this distribution is added to the employment income (if any) to give the total income, which is the only number recorded.

Attribute 5: colour

The person's favourite colour is determined as follows. First, an unobserved binary "childlike" value is selected randomly, with the probability of the person being childlike being 1/(1+exp(age-10)). If the person is a childlike female, her favourite colour is pink with probability 0.9, and is otherwise drawn from the following distribution:

pink: 0.08 blue: 0.27 red: 0.31 green: 0.24 purple: 0.10

Note that pink gets a second chance here. If the person is a childlike male, his favourite colour is blue with probability 0.9, and otherwise is otherwise drawn from the distribution shown above (with blue getting a second chance). If the person is not childlike, their favourite colour is purple with probability 1/(1+exp(-(income-80000)/10000)), and is otherwise once again drawn from the distribution above (with purple thus getting a second chance).

Prototasks

The demo dataset has five prototasks, named according to the attribute to be predicted: age, colour, income, sex, siblings.

Miscellaneous Details

Origin

The origin of the demo dataset is Artificial.

Usage

The dataset is meant to be used for Development (if at all).

Number of Cases

The dataset contains a total of 2048 cases.

Order

The order of the cases within the dataset is uninformative

Variables

There are 5 attributes in each case of the dataset. They are:

SEX - sex of the person
AGE - age of the person in years
SIBLINGS - number of siblings the person has
INCOME - the person's annual income in dollars
COLOUR - the person's favourite colour

Last Updated 26 September 1996
Comments and questions to: delve@cs.toronto.edu