Review of vector subsets.

We again define the salary offers:

offer <- c(241, 590, 533, 425, 261)

Here is a vector of logicals, with TRUE indicating that the element is greater than 400, and FALSE indicating that it’s not.

offer > 400
## [1] FALSE  TRUE  TRUE  TRUE FALSE

We can also get this vector:

offer < 550
## [1]  TRUE FALSE  TRUE  TRUE  TRUE

We can combine those vector using &:

offer > 400 & offer < 550
## [1] FALSE FALSE  TRUE  TRUE FALSE

Here, TRUE corresponds to elements that are both greater than 400 and smaller than 550. We can now retrieve the elements that are between 400 and 550 using

offer[offer > 400 & offer < 550]
## [1] 533 425

Recall that FALSE means “drop” and TRUE means keep, so we just keep the elements between 400 and 550 (not including 400 and 500).

What about the elements of offer outside the range of 400..550? One way to go is

offer[!(offer > 400 & offer < 550)]
## [1] 241 590 261

The ! operator turns TRUE into FALSE and vice versa, so we keep the elements we dropped before, and drop the elements we kept.

Another way to obtain the same result is

offer[offer <= 400 | offer >= 550]
## [1] 241 590 261

Note that we are now using <= rather than <. That because we want offer <= 400 | offer >= 550 to be true when the element is equal to exactly 400 to be consistent with offer[!(offer > 400 & offer < 550)]. (Think this through if it’s not obvious.)

Parallel vectors

Let’s go back to offer. Actually, those were the average offers to various specialties of doctors:

offer <- c(241, 590, 533, 425, 261)
spec <- c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist")

So for example the offers to cardiologist were $590k per year in 2018. Suppose we want to automatically figure out which specialty makes the most money.

So, if we knew it were specialty number 2, we’d go spec[2]. But how do we figure out the 2?

Let’s try another tack. We’d like to compute spec[c(F, T, F, F, F)]. Can we compute the logical vector? Here’s an idea:

offer == max(offer)
## [1] FALSE  TRUE FALSE FALSE FALSE

Combining those ideas, we’ll get

spec[offer == max(offer)]
## [1] "cardiologist"

So how do we get the 2? We can first generate the vector

1:length(offer)  # generate c(1, 2, 3, ..., length(offer))
## [1] 1 2 3 4 5

And now, we can simply go

(1:length(offer))[offer == max(offer)]
## [1] 2

The following is really unnecessarily complicated, but we could have

spec[(1:length(offer))[offer == max(offer)]]
## [1] "cardiologist"

Data frames

Data frames are R’s way of storing tables (note that the salary data we had was also actually a table).

Defining your own data frames

You can define a data frame using the following syntax:

offers <- data.frame(amount = c(241, 590, 533, 425, 261),
                     spec = c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist"))

Note the amount and spec are the names of columns in the new data frame. We define data frames column-by-column.

The value of offers will be displayed as follows in the console:

offers
##   amount          spec
## 1    241    family doc
## 2    590  cardiologist
## 3    533    orthopedic
## 4    425 dermatologist
## 5    261  psychiatrist

Let’s load a data frame (you must previously have successfully run install.packages("babynames")).

library(babynames) #Load the data frame babynames into R

Here are the first several rows of the data frame

head(babynames) # Display the first 5 rows of a data frame
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

We can access, for example, row 2 and column “year” of the table like so:

babynames[2, "year"]
## # A tibble: 1 x 1
##    year
##   <dbl>
## 1  1880

If we want to access all of row 2, we can omit the second part:

babynames[2, ]
## # A tibble: 1 x 5
##    year sex   name      n   prop
##   <dbl> <chr> <chr> <int>  <dbl>
## 1  1880 F     Anna   2604 0.0267

We can access rows 2 through 6 like so:

babynames[2:6, ]
## # A tibble: 5 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Anna       2604 0.0267
## 2  1880 F     Emma       2003 0.0205
## 3  1880 F     Elizabeth  1939 0.0199
## 4  1880 F     Minnie     1746 0.0179
## 5  1880 F     Margaret   1578 0.0162

We can only take the columns "n" and "year", like this:

babynames[2:6, c("n", "year")]
## # A tibble: 5 x 2
##       n  year
##   <int> <dbl>
## 1  2604  1880
## 2  2003  1880
## 3  1939  1880
## 4  1746  1880
## 5  1578  1880

Finally, if we want a particular column as a vector (rather than as a data frame), we can do the following (make sure to keep track of the quotes)

babynames[5:20, ]$name
##  [1] "Minnie"   "Margaret" "Ida"      "Alice"    "Bertha"   "Sarah"   
##  [7] "Annie"    "Clara"    "Ella"     "Florence" "Cora"     "Martha"  
## [13] "Laura"    "Nellie"   "Grace"    "Carrie"

A note about $ vs. [, "colname"]

Note that sometimes, df[, "colname"] will yield a vector (when you are operating on one kind of data frames), and sometimes, it will a data frame. df$colname will always yield a vector. So if you want a vector, use the $ operator.

Working with data frames

Let’s now write code that finds the most common name in 1999

babies.1999 <- babynames[babynames$year == 1999 & babynames$sex == "F", ]
max.name.count <- max(babies.1999$n)
(babies.1999$name)[max.name.count == babies.1999$n]
## [1] "Emily"

Let’s make this into a more general function:

most.common.name <- function(babynames, year, sex){
  b.year.sex <- babynames[babynames$year == year & babynames$sex == sex,  ]
  (b.year.sex$name)[b.year.sex$n == max(b.year.sex$n)]
}

Pipes and the tidyverse

To use pipes, first you need to install and load the tidyverse library. Use the following:

install.packages("tidyverse")
library(tidyverse)
library(tidyverse)

Composition of functions

Let’s define two functions:

f <- function(x){
  x^2
}

g <- function(y){
  y + 1
}

We can compute f(g(5)) and g(f(5)). Note that those are not the same: \(f(g(5)) = f(6) = 36\) but \(g(f(5)) = g(25) = 26\).

Now, we’ll compute the same qunatities using pipes:

5 %>% f %>% g 
## [1] 26

This is the same as computing g(f(10)). The way to think about it is this: we start with 10, then apply f to 10 and obtain f(10), and then apply g to 10 %>% f (i.e., f(10)) and obtain f(g(10)).

To compute f(g(10)), we can use:

5 %>% g %>% f 
## [1] 36

This is the same as

5 %>% g() %>% f()
## [1] 36

If the only thing that we’re sending to f is g(5), the parentheses are optional in this notation.

This is useful mostly because it’s easier to read something like x %>% f1 %>% f2 %>% f3 %>% f4 than to read f4(f3(f2(f1(x)))).

Using tidyverse/dplyr to wrangle data

filter

The filter function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year is 1880 and sex is "F".

filter(babynames, year == 1880, sex == "F")

idx <- (babynames$year == 1880) & (babynames$sex == "F") babynames[idx, ]

babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]

babynames %>% filter(year == 1880 & sex == "F")

babynames %>% filter(year == 1880, sex == "F")

The last way is probably preferable, though different people might have different preferences.

select

We can use select as follows:

b.y.n <- babynames %>% select(year, name)
b.y.n[1:20, ]
## # A tibble: 20 x 2
##     year name     
##    <dbl> <chr>    
##  1  1880 Mary     
##  2  1880 Anna     
##  3  1880 Emma     
##  4  1880 Elizabeth
##  5  1880 Minnie   
##  6  1880 Margaret 
##  7  1880 Ida      
##  8  1880 Alice    
##  9  1880 Bertha   
## 10  1880 Sarah    
## 11  1880 Annie    
## 12  1880 Clara    
## 13  1880 Ella     
## 14  1880 Florence 
## 15  1880 Cora     
## 16  1880 Martha   
## 17  1880 Laura    
## 18  1880 Nellie   
## 19  1880 Grace    
## 20  1880 Carrie

We only kept the year and name columns here.

summarize and group_by

First, let’s use summarize to compute the average number of babies per name:

babynames %>% summarize(b.pername = mean(n))
## # A tibble: 1 x 1
##   b.pername
##       <dbl>
## 1      181.

Inside of summarize, we can refer to columns in the data frame we are processing by name, without the $ operator. (And you shouldn’t be using the $ operator.) This is basically the same as not using summarize and computing mean(babynames$n) – not that interesting.

The real power of summarize is in using group_by – we can group rows and compute a function such as mean for every group of rows separately. For example, we could compute the average number of babies per name for different sexes separately.

babynames %>% group_by(sex) %>% 
              summarize(b.pername = mean(n))
## # A tibble: 2 x 2
##   sex   b.pername
##   <chr>     <dbl>
## 1 F          151.
## 2 M          223.

We see that the number of babies per name for "F" is much smaller than for "M" – female names are more diverse with fewer babies per name, and male names are less diverse with more babies per name.