Review: string literals vs. identifiers

Let’s start with numerics: those are clearly data.

5
## [1] 5

You can store data in variables:

a <- 5
a
## [1] 5

characters are also data, and work the same way

"Hi"
## [1] "Hi"
a <- "Hello"
a
## [1] "Hello"

The data is the text. To tell R that we are referring to the text Hello and not to a variable named hello, we encolose the text in quotes.

We can print values like this

a <- "Hello1"
cat(a)
## Hello1
cat("Hello2")
## Hello2

On the other hand, try this:

cat("a")
## a

This also means that the following would not make sense

f <- print_val("a"){
  cat(a)
}

This does not make sense since when a is a variable, it should not be encolsed in quotes.

Aside: cat could be considered a https://en.wiktionary.org/wiki/disquotation function, as used in the philosophy of language.

Quick intro to sample

sample gives us a random set of items from a vector. By default, the sample is taken “without replacement.” That means that we imagine that we have all the items in the vector in a bag, and we pull them one by one from the bag until we get size items.

set.seed(0)
sample(c(1, 2, 3, 4), size = 2)
## [1] 4 1
sample(c(1, 2, 3, 4), size = 2)
## [1] 2 4
sample(c(1, 2, 3, 4), size = 2)
## [1] 4 1

If we sample with replacement, we put the items we pulled out of the bag back into the bag. That means we might get repeated items in the sample

sample(c(1, 2, 3, 4), size = 2, replace=T)
## [1] 4 4
sample(c(1, 2, 3, 4), size = 2, replace=T)
## [1] 3 3
sample(c(1, 2, 3, 4), size = 2, replace=T)
## [1] 1 1
sample(c(1, 2, 3, 4), size = 2, replace=T)
## [1] 1 3

If we don’t specify the size, we’ll get a sample of the same size as the vector. So in effect we’ll get a random permutation (i.e., rearrangement) of the vector, assuming we sample without replacement.

sample(c(1, 2, 3, 4))
## [1] 2 3 1 4

set.seed(0)

Computers are not able to actually produce random numbers. Instead, they produce a sequence of random-seeming numbers (they are called pseudo-random numbers). set.seed(0) means R will start producing the random-seeming sequence starting from element 0. set.seed(n) means we start from element n. People will sometimes use the current time for n, if they want to make sure that the produce numbers that seem completely random. We will usually go set.seed(0), because the output of functions that use randomness will be the same at every run this way, so that our program will produce the same outputs every time (albeit outputs that will still seem random in some sense). Note that invoking set.seed(0) again will make R start producing random number “from the beginning again”

set.seed(0)
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 9
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 3
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 4
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 6
set.seed(0) # start again -- note that the same numbers are produced
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 9
sample(c(1, 2, 3, 4, 5, 6, 7, 8, 9), 1)
## [1] 3

Splitting the titanic dataset into three sets.

First, let’s produce a vector that contains all the row numbers in titanic, in some order.

set.seed(0)
titanic <- read.csv("titanic.csv")
idx <- sample(1:nrow(titanic))

Now, we’ll produce three non-overlapping sets

titanic.train <- titanic[idx[1:50],]
titanic.valid <- titanic[idx[600:699],]
titanic.test <- titanic[idx[700:800],]

(Why are the sets non-overlapping?)

We’ll use those sets after a small digression.

What do small sets look like?

Let’s write a function that displays ths survival statistics form each passenger class:

PlotClassSurv <- function(titanic.dat){
  titanic.dat$Pclass = as.factor(titanic.dat$Pclass)
  dat <- titanic.dat %>% group_by(Pclass) %>% summarize(Prop.Surv = mean(Survived))
  ggplot(data = dat, mapping = aes(x = Pclass, y = Prop.Surv)) +
    geom_bar(stat = "identity")
}
idx <- sample(1:nrow(titanic))
PlotClassSurv(titanic[idx[1:20],])

idx <- sample(1:nrow(titanic))
PlotClassSurv(titanic[idx[1:20],])

idx <- sample(1:nrow(titanic))
PlotClassSurv(titanic[idx[1:20],])

idx <- sample(1:nrow(titanic))
PlotClassSurv(titanic[idx[1:20],])

idx <- sample(1:nrow(titanic))
PlotClassSurv(titanic[idx[1:20],])

We get a different histogram every time. That is because we get a different small set every time. For the entire dataset, the histogram would be

PlotClassSurv(titanic[idx[1:20],])

The fact that for a small set we’ll get a different histogram every time suggests that we may run into a problem: if we base our prediction on that small set, we will likely not predict correctly on new data. Let’s experiment with this.

Here is a function to compute the correct classification rate:

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  return(mean(pred == dat$Survived))
}
set.seed(0)
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:15],]
titanic.valid <- titanic[idx[600:699],]
titanic.test <- titanic[idx[700:800],]
fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit, titanic.train)
## [1] 0.7333333
GetCorrectClassificationRate(fit, titanic.test)
## [1] 0.5445545
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:15],]
titanic.valid <- titanic[idx[600:699],]
titanic.test <- titanic[idx[700:800],]
fit <- glm(Survived ~ Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit, titanic.train)
## [1] 0.7333333
GetCorrectClassificationRate(fit, titanic.test)
## [1] 0.6633663
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:15],]
titanic.valid <- titanic[idx[600:699],]
titanic.test <- titanic[idx[700:800],]
fit <- glm(Survived ~ Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit, titanic.train)
## [1] 0.7333333
GetCorrectClassificationRate(fit, titanic.test)
## [1] 0.5841584

We generally get higher rates on the training set (on which we fit the model) than on the test set (which is like new data).

This phenomenon is even more pronounced if we use Pclass as a categorical variable: that’s because if (for example) we have just one person in second class, we can predict perfectly what’s going to happen to them in the training set. But that kind prediction won’t necessary be good for people in the test set who are also in second class, since their survival outcome might be different.

An extreme example would be predicting surivival based on a person’s full name – you would get perfect performance on the training set, but would not even be able to run your model on data with different names.

The training set and the test set

The training set is the set used to fit the model – that is the set for which we minimize the cost by adjusting the model coefficients.

The test set is the set used to evaluate the performance of the model. The test set should not overlap with the training set, so that the example in the test set would effectively look like new examples to the model. Performance on new example is what we care about, and performance on the test set is a proxy for that.

Overfitting

Overfitting means obtaining good performance on the training set at the expense of good performance on the test set. Another way of putting it is that the model’s generalization is poor if there is overfitting.

Overfitting is more likely the smaller the training set is. That is because in a small set, patterns that might seem genuine might just be a result of random chance. (Again, think of all people in second class surviving, but there only being one person there.)

Overfitting is also more likely if we use categorical variables, particularly with large numbers of categories. Think for example of predicting based on the last name – if you happen to have just one Jones in the dataset, and they survived, it doesn’t mean that other Joneses will survive, but the model will think so.

Cross-validation

We sometimes want to try different models, and pick the best one. Generally, we would evaluate the models by measuring the performance on the validation set, pick the best one, and then evaluate the final model on the test set.

If we had used the validation set as the test set, we would be overestimating how well the model would perform on new data – basically, we’d be overfitting on the test set.