Week 2 Lecture 2

Review of `dplyr` functions, on the board

Functions we reviewed:

filter
mutate
select
group_by
summarize

Make sure you can articulate how group_by work with summarize and are clear on the differences between filter, select, and summarize.

New `dplyr` function: `n`

n() computes the number of rows in the group. For example, we cam compute the number of different names in the data frame babynames, for each year, like this

library(babynames)
babynames %>% group_by(year) %>% summarize(total.names = n())

## # A tibble: 136 x 2
##     year total.names
##    <dbl>       <int>
##  1  1880        2000
##  2  1881        1935
##  3  1882        2127
##  4  1883        2084
##  5  1884        2297
##  6  1885        2294
##  7  1886        2392
##  8  1887        2373
##  9  1888        2651
## 10  1889        2590
## # ... with 126 more rows

We group the rows by year (one group for 1880, one for 1881, etc.), and then, for each group, computed the number of rows. Since each row corresponds to a different name, we computed the number of names in each year.

Now, let’s compute the number of distinct names per capita in a different way from before:

name.per.cap <-  babynames %>% filter(sex=="F") %>%   
                 select(year, name, count = n) %>% 
                 group_by(year) %>% 
                 summarize(tot.pop = sum(count), tot.names = n()) %>% 
                 mutate(names.per.capita = tot.names/tot.pop)

We renamed the column n to count, to avoid confusion. We did not have to do that.

The only thing to explain here is total.pop = sum(count). To estimate the total population, we are summing up all the counts for the individual names (inside each group, with each group corresponding to one year).

Should we compute names per capita or total names?

There are arguments either way.

Why compute the total number of names?

We are interested in how diverse the names are in the US. In some sense, the population is irrelevant. If there are 50,000 names available to choose from, that’s interesting regardless of the fact that the population is very large.

Why compute the number of names per capita?

If there are only 1,000 babies born, there cannot be more than 1,000 unique names per name. Computing names per capita seems to mitigate this – if there are 1,000 newborns and each of them has a unique name, that indicates that the names are very diverse.

Maybe we should do neither?

Perhaps a compromise would be to compute the number of unique names per cultural or linguistic group. (Although that is difficult to define.)

An aside on rounding

Suppose we want to round $n/prop$ to the nearest 100. Here is how we might go about that if we start with $10557$

$10557\xrightarrow{\times 0.01} 105.57 \xrightarrow{\text{round}}106 \xrightarrow{\times 100}10600$

In R, this would be written as 100 * round(0.01 * 10555). We can use this trick in order to round the population estimated using n and prop as follows:

babynames %>% mutate(total.pop.rounded = 100*round(0.01*n/prop))

(Actually, we could accomplish the same using round(10555, digits = -2). See ?round for details.)

`sapply`

Suppose we want to square a bunch of numbers contained in a vector. We could do that using

x = c(1, 2, 4, 5)
x ** 2

## [1]  1  4 16 25

Let’s now try to use a function:

square <- function(x){
  return(x ** 2)
}
x = c(1, 2, 4, 5)
square(x)

## [1]  1  4 16 25

This still works: out x gets passed to the function, whether x ** 2 is computed by squaring every element of the vector.

But here we’ll see a problem:

special_square <- function(x){
  if(x == 42){
    return(42)
  }else{
    return(x ** 2)
  }
}

x = c(1, 2, 4, 5)
# special_square(x) # would produce an error!

The problem here is that we cannot run if(c(1, 2, 4, 5) == 42). That’s because if(c(F, F, F, F, F)) just doesn’t make sense: there has to be one logical value inside the if.

We can use sapply instead:

x = c(1, 2, 42, 5)
sapply(x, FUN = special_square)

## [1]  1  4 42 25

The function sapply applies the function specified after FUN = to every element of x. This avoids the problem we had before.

`grep`

Here is how the function grep works: it takes in a query character and a vector of characters. It outputs the indices of the characters in which the query is contained:

grep("school", "medical school")

## [1] 1

grep("school", c("medical school", "school of life"))

## [1] 1 2

grep("law", c("medical school", "school of life", "princeton"))

## integer(0)

We can use length in order to see whether a query character is contained in any of the characters in the vector or not:

length(grep("life", "medical school")) > 0 # False

## [1] FALSE

length(grep("life", "school of life")) > 0 # True

## [1] TRUE

`ContainsWord`

Let’s now write a function that determines whether a query word is contained in a text

ContainsWord <- function(text, query){
  return(length(grep(query, text)) > 0)
}

Try using this function to obtain True and False.

(What are some issues with this function?)

The OkCupid dataset

Let’s now apply this to the OkCupid dataset. Suppose that we want to find the percentage of profiles that contain the word "love" in them, for each income bracket.

library(okcupiddata)


profile.aug <- profiles %>% mutate(contains.love = sapply(profiles$essay0, FUN=ContainsWord, "love"))

profile.aug %>% group_by(income) %>% 
                summarize(love.prop = mean(contains.love))

## # A tibble: 13 x 2
##     income love.prop
##      <int>     <dbl>
##  1   20000     0.182
##  2   30000     0.187
##  3   40000     0.181
##  4   50000     0.198
##  5   60000     0.181
##  6   70000     0.173
##  7   80000     0.185
##  8  100000     0.162
##  9  150000     0.158
## 10  250000     0.188
## 11  500000     0.167
## 12 1000000     0.119
## 13      NA     0.180

There is just one thing here that’s new. If we compute the mean of a vector of logical values, we’ll get the proportion of Ts. That’s because Ts get converted to 1s, and Fs get converted to 0s. (Exercise: explain why this means that the mean of the vector will be the proportion of Ts).

Note that sapply apploes the function ContainsWord with the text being each element of profiles$essay0, and the query being "love". This is determined by the order of the variables in function(text, query). The first variable gets assigned elements of the first argument of sapply (i.e., profiles$essay0), and the other variables (just query in this case) get assigned every argument after the FUN argument in sapply(profiles$essay0, FUN=ContainsWord, ..