SML201 Project 2 – Diagnosing Cancer from Microarray Data

Introduction

Every cell contains the organism’s entire DNA. But levels of gene expression differ between cells. The expression levels of different genes can be measured using a DNA microarray.

Cells in cancer tumors can often be characterized by different levels of gene expression. In this project, you will be working on classifying microarray data as coming from a a cell from a acute myeloid leukemia (AML) tumor or a acute lymphoblastic leukemia (ALL) tumor. This kind of process can aid in diagnosing disease.

The patchy colours of a tortoiseshell cat are the result of different levels of expression of pigmentation genes in different areas of the skin (from Wikipedia.)

The data you will be working with is all available from the Broad Institute. However, for your convenience, we suggest that you work with the following processed data files:

Training set: data_set_ALL_AML_train.csv

Validation set: data_set_ALL_AML_independent.csv

Labels: table_ALL_AML_samples.txt

Labels for validation (edited) : table_indep.txt

General guidelines

You will submit an Rmd file and the pdf file that was knit from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

When you are asked to compute something, use code (rather than compute things outside of R and then input the answer), and include the code that you used in the report.

Grading scheme:

10 points for a professional and well formatted report. Points will be taken off if the report is not well-formated and is difficult to read.
5 points for submitting the Rmd file on Blackboard and for submitting the pdf file on Gradescope. Pages on gradescope should be assigned to questions.

Part 1: Reading in the Validation Set (10 pts)

We provide the following code for reading in the validation set:

d.valid <- read.csv("data_set_ALL_AML_independent.csv")
d.valid1 <- d.valid[, sort(colnames(d.valid))[37:70]]
d.valid2 <- data.frame(t(d.valid1))
labels <- read.csv("table_indep.txt", sep = "\t", header = F)
d.valid2[, "y"] <- as.character(trimws(labels$V3)) == "ALL"

Explain precisely and completely what every line accomplishes. In your report, each line of the above code should be followed by an explanation, usually consisting of a sentence or two.

Recall that you can use ? to obtain documentation for a particular function (as in ?t).

Grading scheme

1.5 points each for complete and good explanations of each line, 1 points for overall coherence

Part 2: Reading in the Training Set (10 pts)

Write code to read in the training set, and set the labels of the training set. Explain what modifications you had to make to the code above to accomplish this.

Grading scheme

7 points for implementation
3 points for explanation

Part 3: Computing the t-Statistic (5 pts)

The two-sample t-statistic is defined as follows:

\[T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^{2}/N_1 + s_2^{2}/N_2}}.\]

The t-statistic can be used to measure how different two samples are. In this part, we are interested in how different are the expressions for each gene for ALL and AML tumors.

Write code to compute the t-statistic for the the gene expression for ALL and AML, for each of the genes in the dataset. Include the code in your report.

Part 4: The t-statistic and histograms (10 pts)

Select 5 different genes with different t-statistic values. Display the histograms for the expression levels for ALL and AML tumors, for each of the genes. Choose an appropriate way to arrange the histograms. You likely will want to use facets (see Ch. 4 of Data Visualization).

Grading scheme

Good visualization that demonstrates what happens for different values of the t-statistic: 7 points
Appropriate captions (including the value of the t-statistic): 3 points

Part 5: Some experiments with logistic regression (10 pts)

Fit a logistic regression model to classify ALL vs. AML, using all the available gene expression variables (this can be done using glm(y ~ ., family = binomial, data = train.aml.all)). Do you observe overfitting? Discuss why that might be.
Fit three logistic regression models to classify ALL vs. AML, using one, two, and three gene expressions variables chose arbitrarily (for example, you might just choose the first three available variables). Evaluate the models on the training and the validation set, and report what you observe.

Grading scheme

Correct code: 5 pts
Good reports: 5 pts

Part 6: Variable selection using the t-statistic (30 pts)

Large and positive and large and negative t-statistics indicate that the expression levels are very different in ALL and AML tumors for the gene of interest.

Experiment with using the variable that corresponds to the t-statistic with the largest absolute value, the two variables that correspond to the two t-statistics with the largest absolute value, …, etc up to 10 variables.

Report on what you observe. You should use at least one visualization (produced with ggplot).

(Note: while it is possible to automate the process of using different numbers of variables, it is OK to not automate it, and simply use copy-and-paste to construct the 10 regressions.)

Part 7: Exploring variable selection using the t-statistic (10 pts)

If the t-statistic is used as a way to select the most important variables, the expression which genes would be considered the most important? To answer this, write code to obtain the names of the 10 most important genes, using the t-statistic as the criterion.

An alternative way of doing this would be to obtain the genes that correspond to the highest coefficients when we run the regression glm(y ~ ., family = binomial, data = train.aml.all). What are the top 10 genes that would be obtained that way?

Comment on why the second way can be worse than the first way discussed here.

Grading scheme:

Correct code: 5 points
Good discussion: 5 points

A note

In this project, you are using just one training and one validation set. We decided to forego having a test set since there is very little data to spare. In principle, you would set aside a test set, and evaluate the final model on it.