Week 2 Lecture 1

Pipes and `dplyr`

To use pipes, first you need to install and load the dplyr library. Use the following:

install.packages("dplyr")
library(dplyr)

Composition of functions

Let’s define two functions:

f <- function(x){
  return(x**2) # the same as x*x
}

g <- function(y){
  return(2*y)
}

We can compute f(g(5)) and g(f(5)). Note that those are not the same: $f(g(5)) = f(10) = 100$ but $g(f(5)) = g(25) = 50$.

Now, we’ll compute the same qunatities using pipes:

5 %>% f %>% g

## [1] 50

This is the same as computing g(f(10)). The way to think about it is this: we start with 10, then apply f to 10 and obtain f(10), and then apply g to 10 %>% f (i.e., f(10)) and obtain f(g(10)).

To compute f(g(10)), we can use:

5 %>% g %>% f

## [1] 100

This is the same as

5 %>% g() %>% f()

## [1] 100

If the only thing that we’re sending to f is g(5), the parentheses are optional in this notation.

This is useful mostly because it’s easier to read something like x %>% f1 %>% f2 %>% f3 %>% f4 than to read f4(f3(f2(f1(x)))).

Using `dplyr` to wrangle data

`filter`

The filter function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year is 1880 and sex is "F".

filter(babynames, year == 1880, sex == "F")

idx <- (babynames$year == 1880) & (babynames$sex == "F") babynames[idx, ]

babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]

babynames %>% filter(year == 1880 & sex == "F")

babynames %>% filter(year == 1880, sex == "F")

The last way is probably preferable, though different people might have different preferences.

`arrange`

arrange is used to produce a data frame sorted in whatever way you want to specify. See what happens when you run

View(babynames %>% arrange(name))

The resultant data frame is sorted alphabetically by name. If you want to sort things in descending order, use

View(babynames %>% arrange(desc(name))

Note that, similarly to filter and other functions, this does not change babynames: it just produces a new data frame with the same contents, sorted in the way we specify. You can also sort by year, and then also sort by name, within the same year. Try running and viewing the following:

babynames %>% arrange(year, name)

`rename`

The following produces a data frame with the same contents as babynames, but with the n column now named number:

babynames %>% rename(number = n)

`select`

select is used to produce a data frame only with the specified columns (the columns are in the order that you specify):

babynames %>% select(year, sex, number=n)

(Note that we also renamed the column n to number).

The previous line is equivalent to:

babynames %>% rename(number = n) %>% select(year, sex, number)

`mutate`

Let’s try to estimate the total number of recorded female newborns in a given year. First, let’s do it with an example. If there were 7065 Marys born in the year, and Marys constitute 0.07 of all newboarn female babies, then we can use the following to find the total number of female newborns $N$:

$0.07N = 7065$, so $N = 7065/0.07 \approx 100929$.

We can use mutate to compute a new column:

b <- babynames %>% mutate(total_by_year = round(n/prop))

Here, we compute n/prop (working within the data frame babynames). This means we divided the corresponding entries in n (the total number of babies with a particular name) and prop (the proportion that the babies with that name constitue within their sex). We then rounded the numbers, since the total population should always be an integer.

We stored the resultant data frame in the variable b.

`distinct`

dplyr’s distinct is similar to the function unique, which we already saw, but it operates on data frames. It takes in a data frame, and returns all the distinct rows. That is, duplicate rows are not included in the returned data frame.

b <- babynames %>% mutate(total_by_year = round(n/prop)) b %>% select(sex, year, total_by_year) %>% distinct

`group_by` and `summarize`

Without group_by, the function summarize simply computes a function of a column. For example:

babynames  %>% summarize(total=sum(n))

## # A tibble: 1 x 1
##       total
##       <int>
## 1 340851912

This produces a new data frame, which contains the sum of column n.

The function group_by groups the rows of the data frame by the values in the specified columns, for the purposes of using summarize. For example, consider the following:

babynames %>% group_by(year, sex) %>% summarize(total=sum(n))

This groups the rows of babynames into rows with year==1880, sex==“F” … rows with year==1881, sex==“M” … … … rows with year==2000, sex==“F” …

Then, when we apply summarize, we compute the sum of the column n in each of the groups of rows. So we get as many rows as we have different (year, sex) combinations.

We can compute several different outputs:

babynames %>% summarize(mean=mean(n), median=median(n))

## # A tibble: 1 x 2
##    mean median
##   <dbl>  <int>
## 1  183.     12

Recall:

The mean is the sum divided by the count The median is the number x such that half the set is smaller or equal than x and half the set is larger or equal to x

`n_distinct`

n_distinct(df) is the number of distinct rows in a data frame df. (Recall that for vectors, we could use length(unique(v)).)

Here, we are computing the number of distinct names for each sex

babynames %>% group_by(sex) %>% summarize(distinct_names_year_sex = n_distinct(name))

## # A tibble: 2 x 2
##   sex   distinct_names_year_sex
##   <chr>                   <int>
## 1 F                       65658
## 2 M                       39728

`which.max`

which.max(c(10, 20, 3, 10)) returns the location (i.e., index) of the maximum (see last week’s lectures for a more complicated way of achieving this).

What is the most popular name every year, by sex?

babynames %>% summarize(most_popular_name = name[which.max(n)])
View(babynames %>% group_by(year, sex) %>% summarize(most_popular_name = name[which.max(n)]))

Number of distinct names per capita, every year, for `sex=="F"`

(To be discussed again on Thursday)

a <- babynames %>% group_by(sex,year) %>% 
  summarize(distinct = n_distinct(name),total_by_year = sum(n)) %>% 
  filter(sex=="F") %>% 
  mutate(distinct_per_capita = distinct/total_by_year) %>% 
  ungroup %>% 
  select(year, distinct_per_capita)

plot(a$year, a$distinct_per_capita)

Week 2 Lecture 1

Pipes and dplyr

Composition of functions

Using dplyr to wrangle data

filter

arrange

rename

select

mutate

distinct

group_by and summarize

n_distinct

which.max