Review of PS1

We reviewed the solutions http://guerzhoy.mycpanel.princeton.edu/201s19/pre/P1/p1_soln.html.

$ vs. select

The $ operator allows us to extract a column from a table, as a vector:

gapminder$continent[1:100]  # outputting the first 100 elements so as not to waste space
##   [1] Asia     Asia     Asia     Asia     Asia     Asia     Asia    
##   [8] Asia     Asia     Asia     Asia     Asia     Europe   Europe  
##  [15] Europe   Europe   Europe   Europe   Europe   Europe   Europe  
##  [22] Europe   Europe   Europe   Africa   Africa   Africa   Africa  
##  [29] Africa   Africa   Africa   Africa   Africa   Africa   Africa  
##  [36] Africa   Africa   Africa   Africa   Africa   Africa   Africa  
##  [43] Africa   Africa   Africa   Africa   Africa   Africa   Americas
##  [50] Americas Americas Americas Americas Americas Americas Americas
##  [57] Americas Americas Americas Americas Oceania  Oceania  Oceania 
##  [64] Oceania  Oceania  Oceania  Oceania  Oceania  Oceania  Oceania 
##  [71] Oceania  Oceania  Europe   Europe   Europe   Europe   Europe  
##  [78] Europe   Europe   Europe   Europe   Europe   Europe   Europe  
##  [85] Asia     Asia     Asia     Asia     Asia     Asia     Asia    
##  [92] Asia     Asia     Asia     Asia     Asia     Asia     Asia    
##  [99] Asia     Asia    
## Levels: Africa Americas Asia Europe Oceania

We output just the first 100 elements of the vector. We could use gapminder$continent to get all of them.

On the other hand, if we select one column using select, we will still get a data frame:

gapminder %>% select(continent)
## # A tibble: 1,704 x 1
##    continent
##    <fct>    
##  1 Asia     
##  2 Asia     
##  3 Asia     
##  4 Asia     
##  5 Asia     
##  6 Asia     
##  7 Asia     
##  8 Asia     
##  9 Asia     
## 10 Asia     
## # ... with 1,694 more rows

(Note that when a data frame is output, not all the rows are printed by default, so we did not need to artificially restrict the number of rows outputted like we did with the vector.)

sapply, again

Here is the example we saw last time:

special_square <- function(x){
  if(x == 42){
    return(0)
  }else{
    return(x**2)
  }
}

We couldn’t use special_square(c(1, 2, 42, 3)) because special_square contains an if-statement involving x. So instead, we used

sapply(c(1, 2, 42, 3), FUN = special_square)
## [1] 1 4 0 9

Now, consider a slightly more complicated situation:

mult <- function(x, y){
  return(x * y)
}

The function mult takes two arguments. Suppose we want to multiply every element of c(1, 2, 3) by 7. Here is how we can do it:

v <- c(1, 2, 3)
z <- 7
sapply(v, FUN = mult, z)
## [1]  7 14 21

The result is multiplying v[1], v[2], and v[3] by z. The first argument is the vector each of whose elements we want to pass to the function. After the FUN argument, we can plug in other arguments that would go to mult. In our case, z gets assigned to the parameter y.

The bug in the OKCupid dataset analysis

Consider this code from last lecture.

ContainsWord <- function(query, text){
  return(length(grep(query, text)) > 0)
}
profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))`

Running this code results in an error. The reason is that we are passing each of the elements of the vector profiles$essay0 into query, and we are passing "love" into text. The reason this happens is that elements of the first argument of sapply will go to the first parameter of ContainsWord.

But grep(profiles$essay0[0], "love") makes no sense: we want to look for the pattern "love" in the essay, not the other way around. One solution is to simply switch the paramters query and text in function(query, text). Then, when we are running grep, we’ll be doing something that makes sense.

library(okcupiddata)

ContainsWord <- function(text, query){
  return(length(grep(query, text)) > 0)
}

profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))

profile.aug %>% group_by(income) %>% 
  summarize(love.prop = mean(contains.love))
## # A tibble: 13 x 2
##     income love.prop
##      <int>     <dbl>
##  1   20000     0.182
##  2   30000     0.187
##  3   40000     0.181
##  4   50000     0.198
##  5   60000     0.181
##  6   70000     0.173
##  7   80000     0.185
##  8  100000     0.162
##  9  150000     0.158
## 10  250000     0.188
## 11  500000     0.167
## 12 1000000     0.119
## 13      NA     0.180