Statistical modeling has important implications in healthcare settings. In the intensive care unit (ICU) in hospitals, physiological data is periodically recorded from patients, and this data is used to predict patient outcomes. For this assigment, we will be working with physiological data recorded from thousands of patients in the ICU, along with information about whether they died in the ICU. We will use the techniques we learned during the semester to predict risk of ICU death, and to analyze the dataset for clinical insights.

The data you will be working with is all available from Physionet. You will be looking at only the data in “Training set A”, which has a set of physiological records, in a series of text files (available as a zip file) and a separate text file of outcomes for all patients.

General guidelines

You will submit an Rmd file and the pdf file that was knit from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

When you are asked to compute something, use code (rather than compute things outside of R and then input the answer), and include the code that you used in the report.

Grading scheme:

  • 10 points for a professional and well-formatted report. Points will be taken off if the report is not well-formated and is difficult to read.

  • 5 points for submitting the Rmd file on Blackboard and for submitting the pdf file on Gradescope. Pages on gradescope should be assigned to questions.

Part 1: Reading in the Data (10 pts)

First, you should download the physiological data and patient outcomes file. Downloading, extract all the files from it to a directory, and download Outcomes-a.txt to the same directory. To extract files from a zip file, open the zip file, and then drag files that you see in a window that opens to whatever directory you want.

Read in the files as follows. Your working directory must be set to the directory that contains the text files you extracted. If your filelist is empty, re-check what your work directory is.

Note: use something like


to set your working directory while working on the project draft in an R file. When you are ready to put your work in an Rmd file, you should look at the setup chunk of the handout to see how to set a working directory for an Rmd file.

filelist = list.files(pattern = ".*.txt")
patient.dat = lapply(filelist, function(x) read.csv(x,header = TRUE, stringsAsFactors = FALSE))

We want this data to be read from text files and assembled into a dataframe. To do so, please run the following code which will first define a function that reads a text file, and then runs that function on all the files and assembles the outputs into a single dataframe.

Look at the function comp.patient and explain what it is doing in your report. Look at the first row of allpat.dat and at the first element of filelist, and explain precisely how the Age and Temp columns were computed. Reproduce the computation for Temp in your report and make sure that the numbers match. EDIT: To reproduce the computation, look up the relevant numbers in the relevant txt file, and then enter the, in your Rmd file and performan the computation in R.

comp.patient <- function(ind, dat){
  pat.dat <-dat[ind][[1]]
  pat.dat[pat.dat == -1.0] = NaN  
  Urine <- mean(pat.dat$Value[pat.dat$Parameter == "Urine"])
  HR <- mean(pat.dat$Value[pat.dat$Parameter == "HR"])
  Temp <- mean(pat.dat$Value[pat.dat$Parameter == "Temp"])
  age <- pat.dat$Value[pat.dat$Parameter == "Age"]
  Gender<-pat.dat$Value[pat.dat$Parameter == "Gender"]
  ICUtype<-pat.dat$Value[pat.dat$Parameter == "ICUType"]
  height<-pat.dat$Value[pat.dat$Parameter == "Height"]
  weight<-mean(pat.dat$Value[pat.dat$Parameter == "Weight"])
  NIDiasABP<-mean(pat.dat$Value[pat.dat$Parameter == "NIDiasABP"])
  SysABP<-mean(pat.dat$Value[pat.dat$Parameter == "SysABP"])
  DiasABP<-mean(pat.dat$Value[pat.dat$Parameter == "DiasABP"])
  pH<-mean(pat.dat$Value[pat.dat$Parameter == "pH"])
  PaCO2<-mean(pat.dat$Value[pat.dat$Parameter == "PaCO2"])
  PaO2<-mean(pat.dat$Value[pat.dat$Parameter == "PaO2"])
  Platelets<-mean(pat.dat$Value[pat.dat$Parameter == "Platelets"])
  MAP<-mean(pat.dat$Value[pat.dat$Parameter == "MAP"])
  K <-mean(pat.dat$Value[pat.dat$Parameter == "K"])
  Na<-mean(pat.dat$Value[pat.dat$Parameter == "Na"])
  FiO2<-mean(pat.dat$Value[pat.dat$Parameter == "FiO2"])
  GCS<-mean(pat.dat$Value[pat.dat$Parameter == "GCS"])
  MechVent<-mean(pat.dat$Value[pat.dat$Parameter == "MechVent"])
  ID <- pat.dat$Value[1]
  return(data.frame(ID, age, Gender, height, ICUtype,weight, Urine, HR,Temp,NIDiasABP,SysABP,DiasABP, pH, PaCO2, PaO2, Platelets, MAP, K, Na, FiO2, GCS, MechVent ))

### run above function on all files

numpats <- 4000
df<-lapply(c(1:numpats), FUN = comp.patient, dat=patient.dat)

# assemble into a single dataframe

allpat.dat <- Reduce(function(x, y) rbind(x, y),df) 

We read in an additional file – the outcome of the patients whose physiological recordings were loaded above. We append these outcomes to the original dataframe and remove values that are -1, inf, -inf, or NaN, and replace them with the average value across all patients.

outcome.dat <- read.csv("Outcomes-a.txt")
outcome.dat[outcome.dat == -1] = NaN

#append the outcomes (different files)
full.dat <- cbind(allpat.dat, outcome.dat[,2:ncol(outcome.dat)])
full.dat[full.dat == Inf ] = NaN
full.dat[full.dat == -Inf ] = NaN
#set everything that's NaN to the mean of that column:
for(i in 1:ncol(full.dat )){
  full.dat [ [,i]), i] <- mean(full.dat [,i], na.rm = TRUE)

Part 2: Run logistic regression (10 pts)

Divide your data into training, EDIT: validation, and testing sets. Discard some of the data from the training set such that there are approximately an equal number of datapoints where In.hospital_death value is 1 and where In.hospital_death value is 0.

Use the features HR, Gender, age, temperature, weight, height, PaO2, PaCO2, and run a logistic regression model. Is your performance (on both the training and the validation sets) better than the base rate?

Part 3: Logistic regression on the original training set (5 pts)

Run logistic regression as in Part 2, but without discarding data from the training set. Report your observations.

Part 4: ROC curve (10 pts)

Write a function that, for a given threshold, calculates both the False Positive Rate and True Positive Rate (proportion of deaths correctly identified as such) for your regression model.

For 100 of threshold values equally spaced from 0 to 1, plot the True Positive Rate vs. the False Positive Rate. Use the validation set.

This plot is known as an ROC curve.

EDIT: Use the fit obtained in Part 2 (you can try using the fit from Part 3 and observe that the results are not nearly as good.)

Part 5: Interpreting the ROC curve (5 pts)

Using the plot generated in Part 4, what is the False Positive Rate associated with correctly identifying 90% of patients at risk for death in the ICU? Why might a high false positive rate be appropriate in this setting? You can read the answer off the ROC curve plot.

Part 6(a): Modelling Doctors’ Decision-Making (15 pts)

For this part, produce a short report that answers all the questions below. Include R code that produces the numbers that you need.

At the beginning of their shift, a doctor reviews their patients’ files, and decides what intervention is needed for each patient. In the following Parts, we will be trying to improve this process. We will consider a simplified version of what is going on. Suppose that if the doctor intervenes correctly, the patient will not die; suppose that the doctor has 60 minutes to look through 25 patient files; and suppose that the probability of missing the correct treatment if the doctor spends \(t\) minutes on reviewing a file is

\[P(\textrm{fail}) = \exp(-t^2/100).\]

If the doctor reviews all the files, spends an equal amount of time on each file, and there are 10 patients who will die without the correct intervention, how many patients are expected to die, if the doctor intervenes when they see that that’s needed? What is the percentage of patients who are expected to die, out of 25?

Suppose now that the doctor is looking through all the patient files in the validation set. They would have proportionately more time: \((N/25)\times 60\) minutes in total (where \(N\) is the total number of patients in the set). How many patients would be expected to die, if the doctor intervenes correctly when they know they should do that?

Now, suppose that the doctor only reviews the files of patients for whom the probability of death according to the model is greater than \(50\%\). This would give the doctor more time to look through each file, but the doctor would never be able to intervene in the cases of patients whose probability of death is smaller than \(50\%\). How many patients would be expected to die?

Part 6(b): Modelling Doctors’ Decision-Making (15 pts)

In this Part, you will explore the policy implications of using our model in an understaffed hospital.

Suppose that we are considering a policy of only reviewing the files of patients whose probability of death is above a threshold thr. Each files would be given an equal amount of time, and the total amount of time will be \((N/25)\times 60\).

Using the model from 6(a), plot the total number of expected deaths under the policy vs. the threshold. Using the plot, what is the best threshold to use that would minimize the number of deaths?

You should compute the expected number of deaths for the thresholds seq(0, 1, 0.001).

Use the validation set.

Part 6(c): Modelling Doctors’ Decision-Making (5 pts)

On the test set, compare the total number of expected deaths under the best policy that was selected in 6(b) to reviewing each patient’s file.

Part 7 (10 pts)

Output the summary of the fit from Part 3. If you pre-registered the null hypothesis that the temperature does not influence the probability of death in the ICU, what can you conclude?

What is the interpretation of the coefficient that corresponds to Temp?

(Note: checking the assumptions of logistic regression is tricky in this scenario; if the assumptions are not met, we of course cannot run the test, but you shouldn’t worry about this.)

Bonus (10 pts)

Formulate a plausibly-true hypothesis about the dataset. Produce a visualization that demonstrates that the hypothesis might be correct. Then test that hypothesis. EDIT: use a statistical test. You must check the model assumptions.