n_distinct

n_distinct(df) is the number of distinct rows in a data frame df. (Recall that for vectors, we could use length(unique(v)).)

Here, we are computing the number of distinct names for each sex

babynames %>% group_by(sex, year) %>% summarize(distinct_names_year_sex = n_distinct(name))
## # A tibble: 276 x 3
## # Groups:   sex [2]
##    sex    year distinct_names_year_sex
##    <chr> <dbl>                   <int>
##  1 F      1880                     942
##  2 F      1881                     938
##  3 F      1882                    1028
##  4 F      1883                    1054
##  5 F      1884                    1172
##  6 F      1885                    1197
##  7 F      1886                    1282
##  8 F      1887                    1306
##  9 F      1888                    1474
## 10 F      1889                    1479
## # … with 266 more rows

You can use something like n_distinct(name, year) as well.

which.max

which.max(c(10, 20, 3, 10)) returns the location (i.e., index) of the maximum (see last week’s lectures for a more complicated way of achieving this).

Number of distinct names per capita, every year, for sex=="F"

name.per.cap <- babynames %>% filter(sex=="F") %>%  
                   group_by(year) %>% 
                   summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
                   mutate(distinct.per.capita = distinct.names/total.by.year) %>% 
                   select(year, distinct.per.capita)

plot(name.per.cap$year, name.per.cap$distinct.per.capita)

Can we get the data for “M” and “F” at the same time?

name.per.cap.2 <- babynames %>% 
                   group_by(year, sex) %>% 
                   summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
                   mutate(distinct.per.capita = distinct.names/total.by.year) %>% 
                   select(year, sex, distinct.per.capita)
name.per.cap.2
## # A tibble: 276 x 3
## # Groups:   year [138]
##     year sex   distinct.per.capita
##    <dbl> <chr>               <dbl>
##  1  1880 F                 0.0104 
##  2  1880 M                 0.00958
##  3  1881 F                 0.0102 
##  4  1881 M                 0.00990
##  5  1882 F                 0.00953
##  6  1882 M                 0.00967
##  7  1883 F                 0.00938
##  8  1883 M                 0.00984
##  9  1884 F                 0.00908
## 10  1884 M                 0.00983
## # … with 266 more rows

Should we compute names per capita or total names?

There are arguments either way.

Why compute the total number of names?

We are interested in how diverse the names are in the US. In some sense, the population is irrelevant. If there are 50,000 names available to choose from, that’s interesting regardless of the fact that the population is very large.

Why compute the number of names per capita?

If there are only 1,000 babies born, there cannot be more than 1,000 unique names per name. Computing names per capita seems to mitigate this – if there are 1,000 newborns and each of them has a unique name, that indicates that the names are very diverse.

Maybe we should do neither?

Perhaps a compromise would be to compute the number of unique names per cultural or linguistic group. (Although that is difficult to define.)

sapply

Suppose we want to square a bunch of numbers contained in a vector. We could do that using

vec = c(1, 2, 4, 5)
vec ** 2
## [1]  1  4 16 25

Let’s now try to use a function:

square <- function(x){
  x ** 2
}
vec = c(1, 2, 4, 5)
square(vec)
## [1]  1  4 16 25

This still works: out vec gets passed to the function, and vec ** 2 is computed by squaring every element of the vector.

But here we’ll see a problem:

special.square <- function(x){
  if(x == 42){
    42
  }else{
    x ** 2
  }
}

vec = c(1, 2, 4, 5)
# special.square(vec) # would produce an error!

The problem here is that we cannot run if(c(1, 2, 4, 5) == 42). That’s because if(c(F, F, F, F, F)) just doesn’t make sense: there has to be one logical value inside the if.

We can use sapply instead:

vec = c(1, 2, 42, 5)
sapply(X = vec, FUN = special.square)
## [1]  1  4 42 25

The function sapply applies the function specified after FUN = to every element of the vector after X. This avoids the problem we had before.

grep

Here is how the function grep works: it takes in a query character and a vector of characters. It outputs the indices of the characters in which the query is contained:

grep("school", "medical school")
## [1] 1
grep("school", c("medical school", "school of life"))
## [1] 1 2
grep("law", c("medical school", "school of life", "princeton"))
## integer(0)

We can use length in order to see whether a query character is contained in any of the characters in the vector or not:

length(grep("life", "medical school")) > 0 # False
## [1] FALSE
length(grep("life", "school of life")) > 0 # True
## [1] TRUE

contains.word

Let’s now write a function that determines whether a query word is contained in a text

contains.word <- function(text, query){
  length(grep(query, text)) > 0
}

Try using this function to obtain True and False.

(What are some issues with this function?)

The OkCupid dataset

Let’s now apply this to the OkCupid dataset. Suppose that we want to find the percentage of profiles that contain the word "love" in them, for each income bracket.

library(okcupiddata)


profile.aug <- profiles %>% mutate(contains.love = sapply(X = profiles$essay0, FUN = contains.word, "love"))

profile.aug %>% group_by(income) %>% 
                summarize(love.prop = mean(contains.love))
## # A tibble: 13 x 2
##     income love.prop
##      <int>     <dbl>
##  1   20000     0.182
##  2   30000     0.187
##  3   40000     0.181
##  4   50000     0.198
##  5   60000     0.181
##  6   70000     0.173
##  7   80000     0.185
##  8  100000     0.162
##  9  150000     0.158
## 10  250000     0.188
## 11  500000     0.167
## 12 1000000     0.119
## 13      NA     0.180

There is just one thing here that’s new. If we compute the mean of a vector of logical values, we’ll get the proportion of Ts. That’s because Ts get converted to 1s, and Fs get converted to 0s. (Exercise: explain why this means that the mean of the vector will be the proportion of Ts).

Note that sapply applies the function contains.word with the text being each element of profiles$essay0, and the query being "love". This is determined by the order of the variables in function(text, query). The first variable gets assigned elements of the first argument of sapply (i.e., profiles$essay0), and the other variables (just query in this case) get assigned every argument after the FUN argument in sapply(profiles$essay0, FUN=contains.love, ...)