dplyr
functions, on the boardFunctions we reviewed:
filter
mutate
select
group_by
summarize
Make sure you can articulate how group_by
work with summarize
and are clear on the differences between filter
, select
, and summarize
.
dplyr
function: n
n()
computes the number of rows in the group. For example, we cam compute the number of different names in the data frame babynames
, for each year, like this
library(babynames)
babynames %>% group_by(year) %>% summarize(total.names = n())
## # A tibble: 136 x 2
## year total.names
## <dbl> <int>
## 1 1880 2000
## 2 1881 1935
## 3 1882 2127
## 4 1883 2084
## 5 1884 2297
## 6 1885 2294
## 7 1886 2392
## 8 1887 2373
## 9 1888 2651
## 10 1889 2590
## # ... with 126 more rows
We group the rows by year (one group for 1880, one for 1881, etc.), and then, for each group, computed the number of rows. Since each row corresponds to a different name, we computed the number of names in each year.
Now, let’s compute the number of distinct names per capita in a different way from before:
name.per.cap <- babynames %>% filter(sex=="F") %>%
select(year, name, count = n) %>%
group_by(year) %>%
summarize(tot.pop = sum(count), tot.names = n()) %>%
mutate(names.per.capita = tot.names/tot.pop)
We renamed the column n
to count
, to avoid confusion. We did not have to do that.
The only thing to explain here is total.pop = sum(count)
. To estimate the total population, we are summing up all the counts for the individual names (inside each group, with each group corresponding to one year).
There are arguments either way.
We are interested in how diverse the names are in the US. In some sense, the population is irrelevant. If there are 50,000 names available to choose from, that’s interesting regardless of the fact that the population is very large.
If there are only 1,000 babies born, there cannot be more than 1,000 unique names per name. Computing names per capita seems to mitigate this – if there are 1,000 newborns and each of them has a unique name, that indicates that the names are very diverse.
Perhaps a compromise would be to compute the number of unique names per cultural or linguistic group. (Although that is difficult to define.)
Suppose we want to round \(n/prop\) to the nearest 100. Here is how we might go about that if we start with \(10557\)
\(10557\xrightarrow{\times 0.01} 105.57 \xrightarrow{\text{round}}106 \xrightarrow{\times 100}10600\)
In R, this would be written as 100 * round(0.01 * 10555)
. We can use this trick in order to round the population estimated using n
and prop
as follows:
babynames %>% mutate(total.pop.rounded = 100*round(0.01*n/prop))
(Actually, we could accomplish the same using round(10555, digits = -2)
. See ?round
for details.)
sapply
Suppose we want to square a bunch of numbers contained in a vector. We could do that using
x = c(1, 2, 4, 5)
x ** 2
## [1] 1 4 16 25
Let’s now try to use a function:
square <- function(x){
return(x ** 2)
}
x = c(1, 2, 4, 5)
square(x)
## [1] 1 4 16 25
This still works: out x
gets passed to the function, whether x ** 2
is computed by squaring every element of the vector.
But here we’ll see a problem:
special_square <- function(x){
if(x == 42){
return(42)
}else{
return(x ** 2)
}
}
x = c(1, 2, 4, 5)
# special_square(x) # would produce an error!
The problem here is that we cannot run if(c(1, 2, 4, 5) == 42)
. That’s because if(c(F, F, F, F, F))
just doesn’t make sense: there has to be one logical value inside the if
.
We can use sapply
instead:
x = c(1, 2, 42, 5)
sapply(x, FUN = special_square)
## [1] 1 4 42 25
The function sapply
applies the function specified after FUN =
to every element of x
. This avoids the problem we had before.
grep
Here is how the function grep
works: it takes in a query character and a vector of characters. It outputs the indices of the characters in which the query is contained:
grep("school", "medical school")
## [1] 1
grep("school", c("medical school", "school of life"))
## [1] 1 2
grep("law", c("medical school", "school of life", "princeton"))
## integer(0)
We can use length
in order to see whether a query character is contained in any of the characters in the vector or not:
length(grep("life", "medical school")) > 0 # False
## [1] FALSE
length(grep("life", "school of life")) > 0 # True
## [1] TRUE
ContainsWord
Let’s now write a function that determines whether a query word is contained in a text
ContainsWord <- function(text, query){
return(length(grep(query, text)) > 0)
}
Try using this function to obtain True
and False
.
(What are some issues with this function?)
Let’s now apply this to the OkCupid dataset. Suppose that we want to find the percentage of profiles that contain the word "love"
in them, for each income bracket.
library(okcupiddata)
profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))
profile.aug %>% group_by(income) %>%
summarize(love.prop = mean(contains.love))
## # A tibble: 13 x 2
## income love.prop
## <int> <dbl>
## 1 20000 0.182
## 2 30000 0.187
## 3 40000 0.181
## 4 50000 0.198
## 5 60000 0.181
## 6 70000 0.173
## 7 80000 0.185
## 8 100000 0.162
## 9 150000 0.158
## 10 250000 0.188
## 11 500000 0.167
## 12 1000000 0.119
## 13 NA 0.180
There is just one thing here that’s new. If we compute the mean of a vector of logical values, we’ll get the proportion of T
s. That’s because T
s get converted to 1
s, and F
s get converted to 0
s. (Exercise: explain why this means that the mean of the vector will be the proportion of T
s).
Note that sapply
apploes the function ContainsWord
with the text
being each element of profiles$essay0
, and the query
being "love"
. This is determined by the order of the variables in function(text, query)
. The first variable gets assigned elements of the first argument of sapply
(i.e., profiles$essay0
), and the other variables (just query
in this case) get assigned every argument after the FUN
argument in sapply(profiles$essay0, FUN=ContainsWord, ..