tidyverse
/dplyr
to wrangle datalibrary(tidyverse)
library(babynames)
filter
The filter
function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year
is 1880 and sex
is "F"
.
filter(babynames, year == 1880, sex == "F")
idx <- (babynames$year == 1880) & (babynames$sex == "F")
babynames[idx, ]
babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]
babynames %>% filter(year == 1880 & sex == "F")
babynames %>% filter(year == 1880, sex == "F")
The last way is probably preferable, though different people might have different preferences.
arrange
arrange
is used to produce a data frame sorted in whatever way you want to specify. See what happens when you run
View(babynames %>% arrange(name))
The resultant data frame is sorted alphabetically by name
. If you want to sort things in descending order, use
View(babynames %>% arrange(desc(name))
Note that, similarly to filter
and other functions, this does not change babynames
: it just produces a new data frame with the same contents, sorted in the way we specify. You can also sort by year
, and then also sort by name
, within the same year. Try running and viewing the following:
babynames %>% arrange(year, name)
rename
The following produces a data frame with the same contents as babynames
, but with the n
column now named number
:
babynames %>% rename(number = n)
select
select
is used to produce a data frame only with the specified columns (the columns are in the order that you specify):
babynames %>% select(year, sex, number=n)
(Note that we also renamed the column n
to number
).
The previous line is equivalent to:
babynames %>% rename(number = n) %>% select(year, sex, number)
mutate
Let’s try to estimate the total number of recorded female newborns in a given year. First, let’s do it with an example. If there were 7065 Marys born in the year, and Marys constitute 0.07 of all newboarn female babies, then we can use the following to find the total number of female newborns \(N\):
\(0.07N = 7065\), so \(N = 7065/0.07 \approx 100929\).
We can use mutate
to compute a new column:
b.total <- babynames %>% mutate(total_by_year = round(n/prop))
Here, we compute n/prop
(working within the data frame babynames
). This means we divided the corresponding entries in n
(the total number of babies with a particular name) and prop
(the proportion that the babies with that name constitue within their sex). We then rounded the numbers, since the total population should always be an integer.
We stored the resultant data frame in the variable b.total
. Let’s take a look at some of the rows of b.total
:
b.total
## # A tibble: 1,924,665 x 6
## year sex name n prop total_by_year
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1880 F Mary 7065 0.0724 97605
## 2 1880 F Anna 2604 0.0267 97605
## 3 1880 F Emma 2003 0.0205 97605
## 4 1880 F Elizabeth 1939 0.0199 97605
## 5 1880 F Minnie 1746 0.0179 97605
## 6 1880 F Margaret 1578 0.0162 97605
## 7 1880 F Ida 1472 0.0151 97605
## 8 1880 F Alice 1414 0.0145 97605
## 9 1880 F Bertha 1320 0.0135 97605
## 10 1880 F Sarah 1288 0.0132 97605
## # … with 1,924,655 more rows
distinct
dplyr
’s distinct
is similar to the function unique
, which we already saw, but it operates on data frames. It takes in a data frame, and returns all the distinct rows. That is, duplicate rows are not included in the returned data frame.
Let’s try this:
b.total <- babynames %>% mutate(total_by_year = round(n/prop))
b.total %>% select(sex, year, total_by_year) %>% distinct
## # A tibble: 36,099 x 3
## sex year total_by_year
## <chr> <dbl> <dbl>
## 1 F 1880 97605
## 2 F 1880 97604
## 3 F 1880 97606
## 4 F 1880 97603
## 5 F 1880 97607
## 6 F 1880 97602
## 7 F 1880 97609
## 8 F 1880 97599
## 9 M 1880 118400
## 10 M 1880 118399
## # … with 36,089 more rows
What can you conclude from this?
We can also give arguments to distinct
. (For precise language fans: an argument is a value we pass to the function; a parameter is the variable we assign the argument to).
babynames %>% distinct(year)
## # A tibble: 138 x 1
## year
## <dbl>
## 1 1880
## 2 1881
## 3 1882
## 4 1883
## 5 1884
## 6 1885
## 7 1886
## 8 1887
## 9 1888
## 10 1889
## # … with 128 more rows
babynames %>% distinct(name, year)
## # A tibble: 1,756,284 x 2
## name year
## <chr> <dbl>
## 1 Mary 1880
## 2 Anna 1880
## 3 Emma 1880
## 4 Elizabeth 1880
## 5 Minnie 1880
## 6 Margaret 1880
## 7 Ida 1880
## 8 Alice 1880
## 9 Bertha 1880
## 10 Sarah 1880
## # … with 1,756,274 more rows
There are as many distinct combination of (name, year) as there are rows in babynames
! (Why?)
group_by
and summarize
Without group_by
, the function summarize
simply computes a function of a column. For example:
babynames %>% summarize(total=sum(n))
## # A tibble: 1 x 1
## total
## <int>
## 1 348120517
This produces a new data frame, which contains the sum of the column n
.
The function group_by
groups the rows of the data frame by the values in the specified columns, for the purposes of using summarize
. For example, consider the following:
babynames %>% group_by(year, sex) %>% summarize(total=sum(n))
This groups the rows of babynames
into rows with year==1880, sex==“F” … rows with year==1881, sex==“M” … … … rows with year==2000, sex==“F” …
Then, when we apply summarize
, we compute the sum of the column n
in each of the groups of rows. So we get as many rows as we have different (year, sex) combinations.
We can compute several different outputs:
babynames %>% summarize(mean=mean(n), median=median(n))
## # A tibble: 1 x 2
## mean median
## <dbl> <int>
## 1 181. 12
Recall:
The mean is the sum divided by the count
The median is the number x such that half the set is smaller or equal than x and half the set is larger or equal to x
group_by
and mutate
Here is an example of using mutate
togerther with group_by
:
a <- babynames %>% group_by(year, sex) %>%
mutate(prop.of.total = n/sum(n))
a[1:10, ]
## # A tibble: 10 x 6
## # Groups: year, sex [1]
## year sex name n prop prop.of.total
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1880 F Mary 7065 0.0724 0.0776
## 2 1880 F Anna 2604 0.0267 0.0286
## 3 1880 F Emma 2003 0.0205 0.0220
## 4 1880 F Elizabeth 1939 0.0199 0.0213
## 5 1880 F Minnie 1746 0.0179 0.0192
## 6 1880 F Margaret 1578 0.0162 0.0173
## 7 1880 F Ida 1472 0.0151 0.0162
## 8 1880 F Alice 1414 0.0145 0.0155
## 9 1880 F Bertha 1320 0.0135 0.0145
## 10 1880 F Sarah 1288 0.0132 0.0142
When computing n/sum(n)
, we compute sum(n)
for each group of rows – the total number of boys or girls in a given year. So for each row, we divide the entry in n
by the approximate sum total of the relevant population (e.g. the number of boys registered in 1880).