Week 3 Lecture 1

Number of distinct names per capita, every year, for `sex=="F"`

name.per.cap <- babynames %>% filter(sex=="F") %>%  
                   group_by(year) %>% 
                   summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
                   mutate(distinct.per.capita = distinct.names/total.by.year) %>% 
                   select(year, distinct.per.capita)

plot(name.per.cap$year, name.per.cap$distinct.per.capita)

Can we get the data for “M” and “F” at the same time?

name.per.cap.2 <- babynames %>% 
                   group_by(year, sex) %>% 
                   summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
                   mutate(distinct.per.capita = distinct.names/total.by.year) %>% 
                   select(year, sex, distinct.per.capita)
name.per.cap.2

## # A tibble: 276 x 3
## # Groups:   year [138]
##     year sex   distinct.per.capita
##    <dbl> <chr>               <dbl>
##  1  1880 F                 0.0104 
##  2  1880 M                 0.00958
##  3  1881 F                 0.0102 
##  4  1881 M                 0.00990
##  5  1882 F                 0.00953
##  6  1882 M                 0.00967
##  7  1883 F                 0.00938
##  8  1883 M                 0.00984
##  9  1884 F                 0.00908
## 10  1884 M                 0.00983
## # … with 266 more rows

Should we compute names per capita or total names?

There are arguments either way.

Why compute the total number of names?

We are interested in how diverse the names are in the US. In some sense, the population is irrelevant. If there are 50,000 names available to choose from, that’s interesting regardless of the fact that the population is very large.

Why compute the number of names per capita?

If there are only 1,000 babies born, there cannot be more than 1,000 unique names per name. Computing names per capita seems to mitigate this – if there are 1,000 newborns and each of them has a unique name, that indicates that the names are very diverse.

Maybe we should do neither?

Perhaps a compromise would be to compute the number of unique names per cultural or linguistic group. (Although that is difficult to define.)

`sapply`

Suppose we want to square a bunch of numbers contained in a vector. We could do that using

vec = c(1, 2, 4, 5)
vec ** 2

## [1]  1  4 16 25

Let’s now try to use a function:

square <- function(x){
  x ** 2
}
vec = c(1, 2, 4, 5)
square(vec)

## [1]  1  4 16 25

This still works: out vec gets passed to the function, and vec ** 2 is computed by squaring every element of the vector.

But here we’ll see a problem:

special.square <- function(x){
  if(x == 42){
    42
  }else{
    x ** 2
  }
}

vec = c(1, 2, 4, 5)
# special.square(vec) # would produce an error!

The problem here is that we cannot run if(c(1, 2, 4, 5) == 42). That’s because if(c(F, F, F, F, F)) just doesn’t make sense: there has to be one logical value inside the if.

We can use sapply instead:

vec = c(1, 2, 42, 5)
sapply(X = vec, FUN = special.square)

## [1]  1  4 42 25

The function sapply applies the function specified after FUN = to every element of the vector after X. This avoids the problem we had before.

`grep`

Here is how the function grep works: it takes in a query character and a vector of characters. It outputs the indices of the characters in which the query is contained:

grep("school", "medical school")

## [1] 1

grep("school", c("medical school", "school of life"))

## [1] 1 2

grep("law", c("medical school", "school of life", "princeton"))

## integer(0)

We can use length in order to see whether a query character is contained in any of the characters in the vector or not:

length(grep("life", "medical school")) > 0 # False

## [1] FALSE

length(grep("life", "school of life")) > 0 # True

## [1] TRUE

`contains.word`

Let’s now write a function that determines whether a query word is contained in a text

contains.word <- function(text, query){
  length(grep(query, text)) > 0
}

Try using this function to obtain True and False.

(What are some issues with this function?)

The OkCupid dataset

Let’s now apply this to the OkCupid dataset. Suppose that we want to find the percentage of profiles that contain the word "love" in them, for each income bracket.

library(okcupiddata)


profile.aug <- profiles %>% mutate(contains.love = sapply(X = profiles$essay0, FUN = contains.word, "love"))

profile.aug %>% group_by(income) %>% 
                summarize(love.prop = mean(contains.love))

## # A tibble: 13 x 2
##     income love.prop
##      <int>     <dbl>
##  1   20000     0.182
##  2   30000     0.187
##  3   40000     0.181
##  4   50000     0.198
##  5   60000     0.181
##  6   70000     0.173
##  7   80000     0.185
##  8  100000     0.162
##  9  150000     0.158
## 10  250000     0.188
## 11  500000     0.167
## 12 1000000     0.119
## 13      NA     0.180

There is just one thing here that’s new. If we compute the mean of a vector of logical values, we’ll get the proportion of Ts. That’s because Ts get converted to 1s, and Fs get converted to 0s. (Exercise: explain why this means that the mean of the vector will be the proportion of Ts).

Note that sapply applies the function contains.word with the text being each element of profiles$essay0, and the query being "love". This is determined by the order of the variables in function(text, query). The first variable gets assigned elements of the first argument of sapply (i.e., profiles$essay0), and the other variables (just query in this case) get assigned every argument after the FUN argument in sapply(profiles$essay0, FUN=contains.love, ...)

Week 3 Lecture 1

`n_distinct`

`which.max`

What is the most popular name every year, by sex?

Number of distinct names per capita, every year, for `sex=="F"`

Should we compute names per capita or total names?

Why compute the total number of names?

Why compute the number of names per capita?

Maybe we should do neither?

`sapply`

`grep`

`contains.word`

The OkCupid dataset

Week 3 Lecture 1

n_distinct

which.max

What is the most popular name every year, by sex?

Number of distinct names per capita, every year, for sex=="F"

Should we compute names per capita or total names?

Why compute the total number of names?

Why compute the number of names per capita?

Maybe we should do neither?

sapply

grep

contains.word

The OkCupid dataset

`n_distinct`

`which.max`

Number of distinct names per capita, every year, for `sex=="F"`

`sapply`

`grep`

`contains.word`