We reviewed the solutions http://guerzhoy.mycpanel.princeton.edu/201s19/pre/P1/p1_soln.html.
The $
operator allows us to extract a column from a table, as a vector:
gapminder$continent[1:100] # outputting the first 100 elements so as not to waste space
## [1] Asia Asia Asia Asia Asia Asia Asia
## [8] Asia Asia Asia Asia Asia Europe Europe
## [15] Europe Europe Europe Europe Europe Europe Europe
## [22] Europe Europe Europe Africa Africa Africa Africa
## [29] Africa Africa Africa Africa Africa Africa Africa
## [36] Africa Africa Africa Africa Africa Africa Africa
## [43] Africa Africa Africa Africa Africa Africa Americas
## [50] Americas Americas Americas Americas Americas Americas Americas
## [57] Americas Americas Americas Americas Oceania Oceania Oceania
## [64] Oceania Oceania Oceania Oceania Oceania Oceania Oceania
## [71] Oceania Oceania Europe Europe Europe Europe Europe
## [78] Europe Europe Europe Europe Europe Europe Europe
## [85] Asia Asia Asia Asia Asia Asia Asia
## [92] Asia Asia Asia Asia Asia Asia Asia
## [99] Asia Asia
## Levels: Africa Americas Asia Europe Oceania
We output just the first 100 elements of the vector. We could use gapminder$continent
to get all of them.
On the other hand, if we select one column using select
, we will still get a data frame:
gapminder %>% select(continent)
## # A tibble: 1,704 x 1
## continent
## <fct>
## 1 Asia
## 2 Asia
## 3 Asia
## 4 Asia
## 5 Asia
## 6 Asia
## 7 Asia
## 8 Asia
## 9 Asia
## 10 Asia
## # ... with 1,694 more rows
(Note that when a data frame is output, not all the rows are printed by default, so we did not need to artificially restrict the number of rows outputted like we did with the vector.)
sapply
, againHere is the example we saw last time:
special_square <- function(x){
if(x == 42){
return(0)
}else{
return(x**2)
}
}
We couldn’t use special_square(c(1, 2, 42, 3))
because special_square
contains an if-statement involving x
. So instead, we used
sapply(c(1, 2, 42, 3), FUN = special_square)
## [1] 1 4 0 9
Now, consider a slightly more complicated situation:
mult <- function(x, y){
return(x * y)
}
The function mult
takes two arguments. Suppose we want to multiply every element of c(1, 2, 3)
by 7
. Here is how we can do it:
v <- c(1, 2, 3)
z <- 7
sapply(v, FUN = mult, z)
## [1] 7 14 21
The result is multiplying v[1]
, v[2]
, and v[3]
by z
. The first argument is the vector each of whose elements we want to pass to the function. After the FUN
argument, we can plug in other arguments that would go to mult
. In our case, z
gets assigned to the parameter y
.
Consider this code from last lecture.
ContainsWord <- function(query, text){
return(length(grep(query, text)) > 0)
}
profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))`
Running this code results in an error. The reason is that we are passing each of the elements of the vector profiles$essay0
into query
, and we are passing "love"
into text
. The reason this happens is that elements of the first argument of sapply
will go to the first parameter of ContainsWord
.
But grep(profiles$essay0[0], "love")
makes no sense: we want to look for the pattern "love"
in the essay, not the other way around. One solution is to simply switch the paramters query
and text
in function(query, text)
. Then, when we are running grep
, we’ll be doing something that makes sense.
library(okcupiddata)
ContainsWord <- function(text, query){
return(length(grep(query, text)) > 0)
}
profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))
profile.aug %>% group_by(income) %>%
summarize(love.prop = mean(contains.love))
## # A tibble: 13 x 2
## income love.prop
## <int> <dbl>
## 1 20000 0.182
## 2 30000 0.187
## 3 40000 0.181
## 4 50000 0.198
## 5 60000 0.181
## 6 70000 0.173
## 7 80000 0.185
## 8 100000 0.162
## 9 150000 0.158
## 10 250000 0.188
## 11 500000 0.167
## 12 1000000 0.119
## 13 NA 0.180