Generating bag of words feature vectors

Consider the following code for generating data on the presence of keywords. Draw a graphical model that corresponds to the model below. Explain what’s going on in the code.

Generating the x’s:

set.seed(0)
N <- 20

# Spam type 1: Resetting account password
password.sp.type1 <- rbinom(n = N, size = 1, prob = 0.95)
review.sp.type1 <- rbinom(n = N, size = 1, prob = 0.05)
send.sp.type1 <- rbinom(n = N, size = 1, prob = 0.4)
us.sp.type1 <- rbinom(n = N, size = 1, prob = 0.2)
your.sp.type1 <- rbinom(n = N, size = 1, prob = 0.5)
account.sp.type1 <- rbinom(n = N, size = 1, prob = 0.9)

# Spam type 2: free account
password.sp.type2 <- rbinom(n = N, size = 1, prob = 0.1)
review.sp.type2 <- rbinom(n = N, size = 1, prob = 0.5)
send.sp.type2 <- rbinom(n = N, size = 1, prob = 0.1)
us.sp.type2 <- rbinom(n = N, size = 1, prob = 0.1)
your.sp.type2 <- rbinom(n = N, size = 1, prob = 0.75)
account.sp.type2 <- rbinom(n = N, size = 1, prob = 0.9)

# Non-spam: paper review
password.nsp <- rbinom(n = N, size = 1, prob = 0.1)
review.nsp <- rbinom(n = N, size = 1, prob = 0.9)
send.nsp <- rbinom(n = N, size = 1, prob = 0.05)
us.nsp <- rbinom(n = N, size = 1, prob = 0.7)
your.nsp <- rbinom(n = N, size = 1, prob = 0.5)
account.nsp <- rbinom(n = N, size = 1, prob = 0.1)

Make the training and validation set. The validation set here represents “new” data that’s not used in fitting the model.

make.data.frame <- function(ind){
  data.frame(password = c(password.sp.type1[ind],
                                      password.sp.type2[ind],
                                      password.nsp[ind]),
                         
                         review = c(review.sp.type1[ind],
                                    review.sp.type2[ind],
                                    review.nsp[ind]),
                         
                         send = c(send.sp.type1[ind],
                                  send.sp.type2[ind],
                                  send.nsp[ind]),
                         
                         us = c(us.sp.type1[ind],
                                us.sp.type2[ind],
                                us.nsp[ind]),
                         your = c(your.sp.type1[ind],
                                  your.sp.type2[ind],
                                  your.nsp[ind]),
                         
                         account = c(account.sp.type1[ind],
                                     account.sp.type2[ind],
                                     account.nsp[ind]))
  
}

ind.train <- 1:(N/2)
df.train <- make.data.frame(ind.train)

ind.test <- (N/2+1):N
df.test <- make.data.frame(ind.test)

Suppose y = 1 corresponds to “spam” and y = 0 corresponds to “not spam”. Write code to include a y column in df.train and df.test.

Predictive logistic regression model

Write code to fit the logistic regression model. Display the logistic regression model coefficients, and write down a formula for using them to compute the probability that an email is spam.

Likelihood

Write down the probability of observing the following email, if it is known that it is a spam email:

password = 0
review = 1
send = 1
us = 1
your = 0
account = 0

(spam = 1)

Now write down the probability of observing the following email, if it is known that it is not a spam email:

password = 0
review = 1
send = 1
us = 1
your = 0
account = 0

spam = 0

Now, compute the likelihoods (i.e., \(P(y|x)\)) for each of the points in the training set.The likelihoods should look as follows:

##    prob.y.eq.0.given.x prob.y.eq.1 y Prob.y.given.x
## 1         2.021194e-02  0.97978806 1     0.97978806
## 2         6.300158e-09  0.99999999 1     0.99999999
## 3         1.863231e-09  1.00000000 1     1.00000000
## 4         3.556766e-10  1.00000000 1     1.00000000
## 5         1.148048e-01  0.88519524 1     0.88519524
## 6         2.021194e-02  0.97978806 1     0.97978806
## 7         3.556766e-10  1.00000000 1     1.00000000
## 8         2.021194e-02  0.97978806 1     0.97978806
## 9         3.922452e-03  0.99607755 1     0.99607755
## 10        1.171418e-08  0.99999999 1     0.99999999
## 11        2.555303e-01  0.74446974 1     0.74446974
## 12        3.235241e-02  0.96764759 1     0.96764759
## 13        9.215564e-02  0.90784436 1     0.90784436
## 14        5.894053e-02  0.94105947 1     0.94105947
## 15        5.894053e-02  0.94105947 1     0.94105947
## 16        7.697569e-01  0.23024312 1     0.23024312
## 17        5.894053e-02  0.94105947 1     0.94105947
## 18        5.894053e-02  0.94105947 1     0.94105947
## 19        9.215564e-02  0.90784436 1     0.90784436
## 20        5.259343e-01  0.47406566 1     0.47406566
## 21        5.894053e-02  0.94105947 0     0.05894053
## 22        6.426108e-01  0.35738919 0     0.64261081
## 23        9.187290e-01  0.08127102 0     0.91872898
## 24        7.882846e-01  0.21171543 0     0.78828457
## 25        9.833940e-01  0.01660598 0     0.98339402
## 26        9.833940e-01  0.01660598 0     0.98339402
## 27        7.697569e-01  0.23024312 0     0.76975688
## 28        9.833940e-01  0.01660598 0     0.98339402
## 29        7.697569e-01  0.23024312 0     0.76975688
## 30        9.187290e-01  0.08127102 0     0.91872898

Explain why the following formula is the likelihood of an email with \(\text{spam} = y\) and \(Prob_{model}(spam) = p\)

\(y\times p + (1 - y)\times (1 - p)\)

Use the formula above to compute the likelihood of the test set and the training set. Compare those quantities.

Correct Classification Rate

Suppose that we predict 1 for \(prob > 0.5\), and 0 otherwise. What is the correct classification rate for the training and validation sets? The correct classification rate is the proportion of the time that the output of the classifier matches the ground truth.

Independence

Generate a large training set. Using that training set, compute

\(P(review = 1|spam = 1)\)

\(P(review = 1|spam = 1, your = 1)\)

\(P(review = 1|spam = 1, your = 0)\)

Interpret the results in terms of the conditional independence of the appearance of “your” and the appearance of “review”.

Naive Bayes

Using the counts from the dataset (rather than using logistic regression), compute \(P(review = 1|spam = 1)\) and \(P(review = 0|spam = 1)\), as well as \(P(your = 1|spam = 1)\) and \(P(your = 0|spam = 1)\). Your answer should be of the form \(P(review = 0|spam = 1) = \frac{A}{B}\) where \(A\) and \(B\) are numbers.