dplyr
To use pipes, first you need to install and load the dplyr
library. Use the following:
install.packages("dplyr")
library(dplyr)
Let’s define two functions:
f <- function(x){
return(x**2) # the same as x*x
}
g <- function(y){
return(2*y)
}
We can compute f(g(5))
and g(f(5))
. Note that those are not the same: \(f(g(5)) = f(10) = 100\) but \(g(f(5)) = g(25) = 50\).
Now, we’ll compute the same qunatities using pipes:
5 %>% f %>% g
## [1] 50
This is the same as computing g(f(10))
. The way to think about it is this: we start with 10, then apply f
to 10 and obtain f(10)
, and then apply g to 10 %>% f
(i.e., f(10)
) and obtain f(g(10))
.
To compute f(g(10))
, we can use:
5 %>% g %>% f
## [1] 100
This is the same as
5 %>% g() %>% f()
## [1] 100
If the only thing that we’re sending to f
is g(5)
, the parentheses are optional in this notation.
This is useful mostly because it’s easier to read something like x %>% f1 %>% f2 %>% f3 %>% f4
than to read f4(f3(f2(f1(x))))
.
dplyr
to wrangle datafilter
The filter
function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year
is 1880 and sex
is "F"
.
filter(babynames, year == 1880, sex == "F")
idx <- (babynames$year == 1880) & (babynames$sex == "F")
babynames[idx, ]
babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]
babynames %>% filter(year == 1880 & sex == "F")
babynames %>% filter(year == 1880, sex == "F")
The last way is probably preferable, though different people might have different preferences.
arrange
arrange
is used to produce a data frame sorted in whatever way you want to specify. See what happens when you run
View(babynames %>% arrange(name))
The resultant data frame is sorted alphabetically by name
. If you want to sort things in descending order, use
View(babynames %>% arrange(desc(name))
Note that, similarly to filter
and other functions, this does not change babynames
: it just produces a new data frame with the same contents, sorted in the way we specify. You can also sort by year
, and then also sort by name
, within the same year. Try running and viewing the following:
babynames %>% arrange(year, name)
rename
The following produces a data frame with the same contents as babynames
, but with the n
column now named number
:
babynames %>% rename(number = n)
select
select
is used to produce a data frame only with the specified columns (the columns are in the order that you specify):
babynames %>% select(year, sex, number=n)
(Note that we also renamed the column n
to number
).
The previous line is equivalent to:
babynames %>% rename(number = n) %>% select(year, sex, number)
mutate
Let’s try to estimate the total number of recorded female newborns in a given year. First, let’s do it with an example. If there were 7065 Marys born in the year, and Marys constitute 0.07 of all newboarn female babies, then we can use the following to find the total number of female newborns \(N\):
\(0.07N = 7065\), so \(N = 7065/0.07 \approx 100929\).
We can use mutate
to compute a new column:
b <- babynames %>% mutate(total_by_year = round(n/prop))
Here, we compute n/prop
(working within the data frame babynames
). This means we divided the corresponding entries in n
(the total number of babies with a particular name) and prop
(the proportion that the babies with that name constitue within their sex). We then rounded the numbers, since the total population should always be an integer.
We stored the resultant data frame in the variable b
.
distinct
dplyr
’s distinct
is similar to the function unique
, which we already saw, but it operates on data frames. It takes in a data frame, and returns all the distinct rows. That is, duplicate rows are not included in the returned data frame.
b <- babynames %>% mutate(total_by_year = round(n/prop))
b %>% select(sex, year, total_by_year) %>% distinct
group_by
and summarize
Without group_by
, the function summarize
simply computes a function of a column. For example:
babynames %>% summarize(total=sum(n))
## # A tibble: 1 x 1
## total
## <int>
## 1 340851912
This produces a new data frame, which contains the sum of column n
.
The function group_by
groups the rows of the data frame by the values in the specified columns, for the purposes of using summarize
. For example, consider the following:
babynames %>% group_by(year, sex) %>% summarize(total=sum(n))
This groups the rows of babynames
into rows with year==1880, sex==“F” … rows with year==1881, sex==“M” … … … rows with year==2000, sex==“F” …
Then, when we apply summarize
, we compute the sum of the column n
in each of the groups of rows. So we get as many rows as we have different (year, sex) combinations.
We can compute several different outputs:
babynames %>% summarize(mean=mean(n), median=median(n))
## # A tibble: 1 x 2
## mean median
## <dbl> <int>
## 1 183. 12
Recall:
The mean is the sum divided by the count The median is the number x such that half the set is smaller or equal than x and half the set is larger or equal to x
n_distinct
n_distinct(df)
is the number of distinct rows in a data frame df
. (Recall that for vectors, we could use length(unique(v))
.)
Here, we are computing the number of distinct names for each sex
babynames %>% group_by(sex) %>% summarize(distinct_names_year_sex = n_distinct(name))
## # A tibble: 2 x 2
## sex distinct_names_year_sex
## <chr> <int>
## 1 F 65658
## 2 M 39728
which.max
which.max(c(10, 20, 3, 10))
returns the location (i.e., index) of the maximum (see last week’s lectures for a more complicated way of achieving this).
babynames %>% summarize(most_popular_name = name[which.max(n)])
View(babynames %>% group_by(year, sex) %>% summarize(most_popular_name = name[which.max(n)]))
sex=="F"
(To be discussed again on Thursday)
a <- babynames %>% group_by(sex,year) %>%
summarize(distinct = n_distinct(name),total_by_year = sum(n)) %>%
filter(sex=="F") %>%
mutate(distinct_per_capita = distinct/total_by_year) %>%
ungroup %>%
select(year, distinct_per_capita)
plot(a$year, a$distinct_per_capita)