Using tidyverse/dplyr to wrangle data

library(tidyverse)
library(babynames)

filter

The filter function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year is 1880 and sex is "F".

filter(babynames, year == 1880, sex == "F")

idx <- (babynames$year == 1880) & (babynames$sex == "F")

babynames[idx, ]

babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]

babynames %>% filter(year == 1880 & sex == "F")

babynames %>% filter(year == 1880, sex == "F")

The last way is probably preferable, though different people might have different preferences.

arrange

arrange is used to produce a data frame sorted in whatever way you want to specify. See what happens when you run

View(babynames %>% arrange(name))

The resultant data frame is sorted alphabetically by name. If you want to sort things in descending order, use

View(babynames %>% arrange(desc(name))

Note that, similarly to filter and other functions, this does not change babynames: it just produces a new data frame with the same contents, sorted in the way we specify. You can also sort by year, and then also sort by name, within the same year. Try running and viewing the following:

babynames %>% arrange(year, name)

rename

The following produces a data frame with the same contents as babynames, but with the n column now named number:

babynames %>% rename(number = n)

select

select is used to produce a data frame only with the specified columns (the columns are in the order that you specify):

babynames %>% select(year, sex, number=n)

(Note that we also renamed the column n to number).

The previous line is equivalent to:

babynames %>% rename(number = n) %>% select(year, sex, number)

mutate

Let’s try to estimate the total number of recorded female newborns in a given year. First, let’s do it with an example. If there were 7065 Marys born in the year, and Marys constitute 0.07 of all newboarn female babies, then we can use the following to find the total number of female newborns \(N\):

\(0.07N = 7065\), so \(N = 7065/0.07 \approx 100929\).

We can use mutate to compute a new column:

b.total <- babynames %>% mutate(total_by_year = round(n/prop))

Here, we compute n/prop (working within the data frame babynames). This means we divided the corresponding entries in n (the total number of babies with a particular name) and prop (the proportion that the babies with that name constitue within their sex). We then rounded the numbers, since the total population should always be an integer.

We stored the resultant data frame in the variable b.total. Let’s take a look at some of the rows of b.total:

b.total
## # A tibble: 1,924,665 x 6
##     year sex   name          n   prop total_by_year
##    <dbl> <chr> <chr>     <int>  <dbl>         <dbl>
##  1  1880 F     Mary       7065 0.0724         97605
##  2  1880 F     Anna       2604 0.0267         97605
##  3  1880 F     Emma       2003 0.0205         97605
##  4  1880 F     Elizabeth  1939 0.0199         97605
##  5  1880 F     Minnie     1746 0.0179         97605
##  6  1880 F     Margaret   1578 0.0162         97605
##  7  1880 F     Ida        1472 0.0151         97605
##  8  1880 F     Alice      1414 0.0145         97605
##  9  1880 F     Bertha     1320 0.0135         97605
## 10  1880 F     Sarah      1288 0.0132         97605
## # … with 1,924,655 more rows

distinct

dplyr’s distinct is similar to the function unique, which we already saw, but it operates on data frames. It takes in a data frame, and returns all the distinct rows. That is, duplicate rows are not included in the returned data frame.

Let’s try this:

b.total <- babynames %>% mutate(total_by_year = round(n/prop))
b.total %>% select(sex, year, total_by_year) %>% distinct
## # A tibble: 36,099 x 3
##    sex    year total_by_year
##    <chr> <dbl>         <dbl>
##  1 F      1880         97605
##  2 F      1880         97604
##  3 F      1880         97606
##  4 F      1880         97603
##  5 F      1880         97607
##  6 F      1880         97602
##  7 F      1880         97609
##  8 F      1880         97599
##  9 M      1880        118400
## 10 M      1880        118399
## # … with 36,089 more rows

What can you conclude from this?

We can also give arguments to distinct. (For precise language fans: an argument is a value we pass to the function; a parameter is the variable we assign the argument to).

babynames %>% distinct(year)
## # A tibble: 138 x 1
##     year
##    <dbl>
##  1  1880
##  2  1881
##  3  1882
##  4  1883
##  5  1884
##  6  1885
##  7  1886
##  8  1887
##  9  1888
## 10  1889
## # … with 128 more rows
babynames %>% distinct(name, year)
## # A tibble: 1,756,284 x 2
##    name       year
##    <chr>     <dbl>
##  1 Mary       1880
##  2 Anna       1880
##  3 Emma       1880
##  4 Elizabeth  1880
##  5 Minnie     1880
##  6 Margaret   1880
##  7 Ida        1880
##  8 Alice      1880
##  9 Bertha     1880
## 10 Sarah      1880
## # … with 1,756,274 more rows

There are as many distinct combination of (name, year) as there are rows in babynames! (Why?)

group_by and summarize

Without group_by, the function summarize simply computes a function of a column. For example:

babynames  %>% summarize(total=sum(n))
## # A tibble: 1 x 1
##       total
##       <int>
## 1 348120517

This produces a new data frame, which contains the sum of the column n.

The function group_by groups the rows of the data frame by the values in the specified columns, for the purposes of using summarize. For example, consider the following:

babynames %>% group_by(year, sex) %>% summarize(total=sum(n))

This groups the rows of babynames into rows with year==1880, sex==“F” … rows with year==1881, sex==“M” … … … rows with year==2000, sex==“F” …

Then, when we apply summarize, we compute the sum of the column n in each of the groups of rows. So we get as many rows as we have different (year, sex) combinations.

We can compute several different outputs:

babynames %>% summarize(mean=mean(n), median=median(n))
## # A tibble: 1 x 2
##    mean median
##   <dbl>  <int>
## 1  181.     12

Recall:

The mean is the sum divided by the count

The median is the number x such that half the set is smaller or equal than x and half the set is larger or equal to x

group_by and mutate

Here is an example of using mutate togerther with group_by:

a <- babynames %>% group_by(year, sex) %>% 
                   mutate(prop.of.total = n/sum(n))
a[1:10, ]
## # A tibble: 10 x 6
## # Groups:   year, sex [1]
##     year sex   name          n   prop prop.of.total
##    <dbl> <chr> <chr>     <int>  <dbl>         <dbl>
##  1  1880 F     Mary       7065 0.0724        0.0776
##  2  1880 F     Anna       2604 0.0267        0.0286
##  3  1880 F     Emma       2003 0.0205        0.0220
##  4  1880 F     Elizabeth  1939 0.0199        0.0213
##  5  1880 F     Minnie     1746 0.0179        0.0192
##  6  1880 F     Margaret   1578 0.0162        0.0173
##  7  1880 F     Ida        1472 0.0151        0.0162
##  8  1880 F     Alice      1414 0.0145        0.0155
##  9  1880 F     Bertha     1320 0.0135        0.0145
## 10  1880 F     Sarah      1288 0.0132        0.0142

When computing n/sum(n), we compute sum(n) for each group of rows – the total number of boys or girls in a given year. So for each row, we divide the entry in n by the approximate sum total of the relevant population (e.g. the number of boys registered in 1880).