Assignment 4

Deadline: February 17, 9pm

Late Penalty: See Syllabus

TA: Hojjat Salehinejad

In this assignment, you will build and train an autoencoder for imputation of missing data. In the process, you will:

  1. Clean and process continuous and categorical data for machine learning.
  2. Understand and implement denoising autoencoders.
  3. Tuning the hyperparameter setting of an autoencoder.

What to submit

Submit a PDF file containing all your code and outputs. Do not submit any other files produced by your code.

Completing this assignment using Jupyter Notebook is recommended (though not necessarily for all subsequent assignments). If you are using Jupyter Notebook, you can export a PDF file using the menu option File -> Download As -> PDF via LaTeX (pdf)

In [ ]:
import csv
import numpy as np
import random
import torch
import torch.utils.data

Part 0

We will be using a package called pandas for this assignment. Installation instructions for pandas is available here: https://pandas.pydata.org/pandas-docs/stable/install.html

If you cannot get pandas installed, you may port the pandas code we provided into numpy code.

In [ ]:
import pandas as pd

Part 1. Data Cleaning [12 pt]

The data set we will be using for this assignment is the Adult Data Set provided by UCI Machine Learning Repository [1] available at https://archive.ics.uci.edu/ml/datasets/adult.

Download the file adult.data from the website.

The data set contains census record files of adults, including their age, type of work they do, martial status, etc. We will build a denoising autoencoder on this dataset to impute (or "fill in") missing values in the dataset.

[1] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Part (a) Loading the Data [1 pt]

Use the function pd.read_csv to load the adult.data into a pandas dataframe called df. Make sure that the adult.data file is in the same folder as your notebook or python code. Report the number of rows (records) in your data frame.

Note that the data file does not have an index column. The headers of the file are given to you below.

Hint: You will need to read a bit about the pandas documentation to do this problem https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [ ]:
header = ['age', 'work', 'fnlwgt', 'edu', 'yredu', 'marriage', 'occupation',
 'relationship', 'race', 'sex', 'capgain', 'caploss', 'workhr', 'country']
df = pd.read_csv # ...

Part (b) Continuous Features [1 pt]

For each of the columns ["age", "yredu", "capgain", "caploss", "workhr"], find the minimum, maximum, and average value across the dataset.

Like numpy arrays and torch tensors, pandas data frames can be sliced. For example, we can display the first 5 rows of the data frame (5 records) below:

In [ ]:
df[:3]

Alternatively, we can slice based on column names, for example df["race"], df["hr"], or even index multiple columns like below.

In [ ]:
subdf = df[["age", "yredu", "capgain", "caploss", "workhr"]]
subdf[:3] # show the first 3 records

Numpy works nicely with pandas, like below:

In [ ]:
np.sum(subdf["caploss"])

Part (c) Normalizing Continuous Features [1 pt]

Normalize each of the features ["age", "yredu", "capgain", "caploss", "workhr"] so that their values are between 0 and 1. Just like numpy arrays, you can modify data frames. For example, the code

df["age"] = df["age"] + 1

would increase everyone's age by 1.

Part (d) Categorical Features [1 pt]

What percentage of people in our data set is male? Note that the data labels all have an unfortunate space in the beginning, e.g. " Male" instead of "Male".

What percentage of people in our data set is female?

In [ ]:
# hint: you can do something like this in pandas
sum(df["sex"] == " Male")

Part (e) Missing Values [1 pt]

We will do two things in this part:

  1. We will restrict ourselves to a subset of the features
  2. We will remove any records (rows) containing missing values, and store it in a second dataframe.

Both of these steps are done for you.

Report the number of records with and without missing features of interest.

In [ ]:
contcols = ["age", "yredu", "capgain", "caploss", "workhr"]
catcols = ["work", "marriage", "occupation", "edu", "relationship", "sex"]
features = contcols + catcols
df = df[features]
In [ ]:
missing = pd.concat([df[c] == " ?" for c in catcols], axis=1).any(axis=1)
df_with_missing = df[missing]
df_not_missing = df[~missing]

Part (f) One-Hot Encoding [1 pt]

What are all the possible values of "work" in df_not_missing? You may find the Python function set useful.

As discussed in class and in tutorial, we will be using a one-hot encoding to encode each of the categorical variables.

We will use the pandas function get_dummies:

In [ ]:
data = pd.get_dummies(df_not_missing)
In [ ]:
data[:5]

Part (g) One-Hot Encoding [1 pt]

How many columns are in the dataframe data?

Briefly explain where that number come from. (You don't need to be detailed here.)

Part (h) One-Hot Conversion [3 pt]

We will convert the pandas data frame into numpy below. However, in doing so, we lose the column information that a panda data frame automatically stores.

Complete the function get_categorical_value that will return the named value of a feature given a one-hot embedding. You may find the global variables cat_index and cat_values useful. (Display them and figure out what they are first.)

We will use this function on the output of our autoencoder, to interpret our autoencoder outputs. So the input one-hot vectors might not actually be "one-hot".

In [ ]:
datanp = data.values.astype(np.float32)
In [ ]:
cat_index = {}  # Mapping of feature -> start index of feature in a record
cat_values = {} # Mapping of feature -> list of categorical values the feature can take

# build up the cat_index and cat_values dictionary
for i, header in enumerate(data.keys()):
    if "_" in header: # categorical header
        feature, value = header.split()
        feature = feature[:-1] # remove the last char; it is always an underscore
        if feature not in cat_index:
            cat_index[feature] = i
            cat_values[feature] = [value]
        else:
            cat_values[feature].append(value)

def get_onehot(record, feature):
    """
    Return the portion of `record` that is the one-hot encoding
    of feature. For example, since the feature "work" is stored
    in the indices [5:12] in each record, calling `get_range(record, "work")`
    is equivalent to accessing `record[5:12]`.
    
    Args:
        - record: a numpy array representing one record, formatted
                  the same way as a row in `data.np`
        - feature: a string, should be an element of `catcols`
    """
    start_index = cat_index[feature]
    stop_index = cat_index[feature] + len(cat_values[feature])
    return record[start_index:stop_index]

def get_categorical_value(onehot, feature):
    """
    Return the categorical value name of a feature given
    a one-hot vector representing the feature.
    
    Args:
        - onehot: a numpy array one-hot representation of the feature
        - feature: a string, should be an element of `catcols`
        
    Examples:
    
    >>> get_categorical_value(np.array([0., 0., 0., 0., 0., 1., 0.]), "work")
    'State-gov'
    >>> get_categorical_value(np.array([0.1, 0., 1.1, 0.2, 0., 1., 0.]), "work")
    'Private'
    """
    # TODO

def get_feature(record, feature):
    """
    Return the categorical feature value of a record
    """
    onehot = get_onehot(record, feature)
    return get_categorical_value(onehot, feature)

def get_features(record):
    """
    Return a dictionary of all categorical feature values of a record
    """
    return { f: get_feature(record, f) for f in catcols }

Part (i) Train/Test Split [2 pt]

Randomly split the data into approximately 70% training, 15% validation and 15% test.

Report the number of items in your training, validation, and test set.

In [ ]:
np.random.seed(50) # set the numpy seed for consistent split

# todo

Part 2. Model Setup [4 pt]

Design a fully-connected autoencoder by modifying the encoder and decoder.

There will be a sigmoid activation at the decoder, so that the output of the decoder is between 0 and 1. We will not interpret the output of the sigmoid as a probability.

In [ ]:
from torch import nn

class AutoEncoder(nn.Module):
    def __init__(self):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(57, 57) # TODO
        )
        self.decoder = nn.Sequential(
            nn.Linear(57, 57), # TODO
            nn.Sigmoid() # get to the range (0, 1)
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Part 3. Training [18]

Part (a) [6 pt]

We will train our autoencoder as follows:

  • In each iteration, we will hide one of the categorical features using the zero_out_random_features function
  • We will pass the data with one missing feature through the autoencoder, obtaining a reconstruction
  • We will check how close the reconstruction is compared to the original data (without the missing feature)

Complete the code to train the autoencoder, and plot the training and validation loss every few iterations. You may also want to plot training and validation "accuracy" every few iterations, as we will define in part (b). You may also want to checkpoint your model every few epochs.

Use nn.MSELoss() as your loss function. (Side note: you might recognize that this loss function is not ideal for this problem, but we will use it anyways.)

In [ ]:
def zero_out_feature(records, feature):
    """ Set the feature missing in records, by setting the appropriate
    columns of records to 0
    """
    start_index = cat_index[feature]
    stop_index = cat_index[feature] + len(cat_values[feature])
    records[:, start_index:stop_index] = 0
    return records

def zero_out_random_feature(records):
    """ Set one random feature missing in records, by setting the 
    appropriate columns of records to 0
    """
    return zero_out_feature(records, random.choice(catcols))

def train(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-4):
    """ Training loop. You should update this."""
    torch.manual_seed(42)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        for data in train_loader:
            datam = zero_out_random_feature(data.clone()) # zero out one categorical feature
            recon = model(datam)
            loss = criterion(recon, data)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Part (b) [3 pt]

While plotting training and validation loss is valuable, loss values are less easy to compare than accuracy percentages. The reason is that the scale of the loss value changes depending on your batch size. It would be nice to have a measure of "accuracy" in this problem.

Since we will only be imputing missing categorical values, we will define an accuracy measure. For each record and for each categorical feature, we determine whether the model can predict the categorical feature given all the other features of the record.

A function get_accuracy is written for you. It is up to you to figure out how to use the function. You don't need to do anything else in this part. To earn the 4 marks, plot the training and validation accuracy every few iterations/epochs as part of your training curve.

In [ ]:
def get_accuracy(model, data_loader):
    """Return the "accuracy" of the autoencoder model across a data set
    
    Args:
       - model: the autoencoder model, an instance of nn.Module
       - data_loader: an instance of torch.utils.data.DataLoader

    Example (to illustrate how get_accuracy is intended to be called.
             depending on your variable naming this code might not work
             out of the box)

        >>> model = AutoEncoder()
        >>> vdl = torch.utils.data.DataLoader(data_valid, batch_size=256, shuffle=True)
        >>> get_accuracy(model, vdl)
    """
    total = 0
    acc = 0
    for col in catcols:
        for item in data_loader: # minibatches
            inp = item.detach().numpy()
            out = model(zero_out_feature(item.clone(), col)).detach().numpy()
            for i in range(out.shape[0]): # record in minibatch
                acc += int(get_feature(out[i], col) == get_feature(inp[i], col))
                total += 1
    return acc / total

Part (c) [4 pt]

Run the training code, using reasonable settings like batch_size, learning rate, etc.

Include your training curve in your pdf output.

Part (d) [5 pt]

Tune your hyperparameters, training at least 4 different models.

Do not include all your training curves. Instead, explain what hyperparameters you tried, what their effect was, and what your thought process was as you chose the next set of hyperparameters to try.

Part 4. Testing [6 pt]

Part (a) [1 pt]

Compute the test accuracy across the test set.

Part (b) [2 pt]

Consider an alterative, baseline model that predicts missing data as follows. To predict a missing feature, the baseline model will look at the most common value of the feature in the training set. For example, if the feature "marriage" is missing, then this model's prediction will be the most common value for "marriage" in the training set.

What would be the test accuracy of this baseline model?

It is often helpful to use the performance of the baseline model to help judge how well our model is actually performing. No explanation is required in this question, just your calculations.

Part (c) [1 pt]

Look at the first item in your test data. Do you think it is reasonable for a human to be able to guess this person's education level based on their other features? Explain.

Part (d) [2 pt]

What is your model's guess of this person's education level, given their other features?