n_distinct
n_distinct(df)
is the number of distinct rows in a data frame df
. (Recall that for vectors, we could use length(unique(v))
.)
Here, we are computing the number of distinct names for each sex
babynames %>% group_by(sex, year) %>% summarize(distinct_names_year_sex = n_distinct(name))
## # A tibble: 276 x 3
## # Groups: sex [2]
## sex year distinct_names_year_sex
## <chr> <dbl> <int>
## 1 F 1880 942
## 2 F 1881 938
## 3 F 1882 1028
## 4 F 1883 1054
## 5 F 1884 1172
## 6 F 1885 1197
## 7 F 1886 1282
## 8 F 1887 1306
## 9 F 1888 1474
## 10 F 1889 1479
## # … with 266 more rows
You can use something like n_distinct(name, year)
as well.
which.max
which.max(c(10, 20, 3, 10))
returns the location (i.e., index) of the maximum (see last week’s lectures for a more complicated way of achieving this).
babynames %>% group_by(year, sex) %>%
summarize(most_popular_name = name[which.max(n)])
## # A tibble: 276 x 3
## # Groups: year [138]
## year sex most_popular_name
## <dbl> <chr> <chr>
## 1 1880 F Mary
## 2 1880 M John
## 3 1881 F Mary
## 4 1881 M John
## 5 1882 F Mary
## 6 1882 M John
## 7 1883 F Mary
## 8 1883 M John
## 9 1884 F Mary
## 10 1884 M John
## # … with 266 more rows
sex=="F"
name.per.cap <- babynames %>% filter(sex=="F") %>%
group_by(year) %>%
summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
mutate(distinct.per.capita = distinct.names/total.by.year) %>%
select(year, distinct.per.capita)
plot(name.per.cap$year, name.per.cap$distinct.per.capita)
Can we get the data for “M” and “F” at the same time?
name.per.cap.2 <- babynames %>%
group_by(year, sex) %>%
summarize(distinct.names = n_distinct(name),total.by.year = sum(n)) %>%
mutate(distinct.per.capita = distinct.names/total.by.year) %>%
select(year, sex, distinct.per.capita)
name.per.cap.2
## # A tibble: 276 x 3
## # Groups: year [138]
## year sex distinct.per.capita
## <dbl> <chr> <dbl>
## 1 1880 F 0.0104
## 2 1880 M 0.00958
## 3 1881 F 0.0102
## 4 1881 M 0.00990
## 5 1882 F 0.00953
## 6 1882 M 0.00967
## 7 1883 F 0.00938
## 8 1883 M 0.00984
## 9 1884 F 0.00908
## 10 1884 M 0.00983
## # … with 266 more rows
There are arguments either way.
We are interested in how diverse the names are in the US. In some sense, the population is irrelevant. If there are 50,000 names available to choose from, that’s interesting regardless of the fact that the population is very large.
If there are only 1,000 babies born, there cannot be more than 1,000 unique names per name. Computing names per capita seems to mitigate this – if there are 1,000 newborns and each of them has a unique name, that indicates that the names are very diverse.
Perhaps a compromise would be to compute the number of unique names per cultural or linguistic group. (Although that is difficult to define.)
sapply
Suppose we want to square a bunch of numbers contained in a vector. We could do that using
vec = c(1, 2, 4, 5)
vec ** 2
## [1] 1 4 16 25
Let’s now try to use a function:
square <- function(x){
x ** 2
}
vec = c(1, 2, 4, 5)
square(vec)
## [1] 1 4 16 25
This still works: out vec
gets passed to the function, and vec ** 2
is computed by squaring every element of the vector.
But here we’ll see a problem:
special.square <- function(x){
if(x == 42){
42
}else{
x ** 2
}
}
vec = c(1, 2, 4, 5)
# special.square(vec) # would produce an error!
The problem here is that we cannot run if(c(1, 2, 4, 5) == 42)
. That’s because if(c(F, F, F, F, F))
just doesn’t make sense: there has to be one logical value inside the if
.
We can use sapply
instead:
vec = c(1, 2, 42, 5)
sapply(X = vec, FUN = special.square)
## [1] 1 4 42 25
The function sapply
applies the function specified after FUN =
to every element of the vector after X
. This avoids the problem we had before.
grep
Here is how the function grep
works: it takes in a query character and a vector of characters. It outputs the indices of the characters in which the query is contained:
grep("school", "medical school")
## [1] 1
grep("school", c("medical school", "school of life"))
## [1] 1 2
grep("law", c("medical school", "school of life", "princeton"))
## integer(0)
We can use length
in order to see whether a query character is contained in any of the characters in the vector or not:
length(grep("life", "medical school")) > 0 # False
## [1] FALSE
length(grep("life", "school of life")) > 0 # True
## [1] TRUE
contains.word
Let’s now write a function that determines whether a query word is contained in a text
contains.word <- function(text, query){
length(grep(query, text)) > 0
}
Try using this function to obtain True
and False
.
(What are some issues with this function?)
Let’s now apply this to the OkCupid dataset. Suppose that we want to find the percentage of profiles that contain the word "love"
in them, for each income bracket.
library(okcupiddata)
profile.aug <- profiles %>% mutate(contains.love = sapply(X = profiles$essay0, FUN = contains.word, "love"))
profile.aug %>% group_by(income) %>%
summarize(love.prop = mean(contains.love))
## # A tibble: 13 x 2
## income love.prop
## <int> <dbl>
## 1 20000 0.182
## 2 30000 0.187
## 3 40000 0.181
## 4 50000 0.198
## 5 60000 0.181
## 6 70000 0.173
## 7 80000 0.185
## 8 100000 0.162
## 9 150000 0.158
## 10 250000 0.188
## 11 500000 0.167
## 12 1000000 0.119
## 13 NA 0.180
There is just one thing here that’s new. If we compute the mean of a vector of logical values, we’ll get the proportion of T
s. That’s because T
s get converted to 1
s, and F
s get converted to 0
s. (Exercise: explain why this means that the mean of the vector will be the proportion of T
s).
Note that sapply
applies the function contains.word
with the text
being each element of profiles$essay0
, and the query
being "love"
. This is determined by the order of the variables in function(text, query)
. The first variable gets assigned elements of the first argument of sapply
(i.e., profiles$essay0
), and the other variables (just query
in this case) get assigned every argument after the FUN
argument in sapply(profiles$essay0, FUN=contains.love, ...)