We again define the salary offers:
offer <- c(241, 590, 533, 425, 261)
Here is a vector of logical
s, with TRUE
indicating that the element is greater than 400, and FALSE
indicating that it’s not.
offer > 400
## [1] FALSE TRUE TRUE TRUE FALSE
We can also get this vector:
offer < 550
## [1] TRUE FALSE TRUE TRUE TRUE
We can combine those vector using &
:
offer > 400 & offer < 550
## [1] FALSE FALSE TRUE TRUE FALSE
Here, TRUE
corresponds to elements that are both greater than 400 and smaller than 550. We can now retrieve the elements that are between 400 and 550 using
offer[offer > 400 & offer < 550]
## [1] 533 425
Recall that FALSE
means “drop” and TRUE
means keep, so we just keep the elements between 400 and 550 (not including 400 and 500).
What about the elements of offer outside the range of 400..550? One way to go is
offer[!(offer > 400 & offer < 550)]
## [1] 241 590 261
The !
operator turns TRUE
into FALSE
and vice versa, so we keep the elements we dropped before, and drop the elements we kept.
Another way to obtain the same result is
offer[offer <= 400 | offer >= 550]
## [1] 241 590 261
Note that we are now using <=
rather than <
. That because we want offer <= 400 | offer >= 550
to be true when the element is equal to exactly 400 to be consistent with offer[!(offer > 400 & offer < 550)]
. (Think this through if it’s not obvious.)
Let’s go back to offer
. Actually, those were the average offers to various specialties of doctors:
offer <- c(241, 590, 533, 425, 261)
spec <- c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist")
So for example the offers to cardiologist were $590k per year in 2018. Suppose we want to automatically figure out which specialty makes the most money.
So, if we knew it were specialty number 2, we’d go spec[2]
. But how do we figure out the 2
?
Let’s try another tack. We’d like to compute spec[c(F, T, F, F, F)]
. Can we compute the logical vector? Here’s an idea:
offer == max(offer)
## [1] FALSE TRUE FALSE FALSE FALSE
Combining those ideas, we’ll get
spec[offer == max(offer)]
## [1] "cardiologist"
So how do we get the 2? We can first generate the vector
1:length(offer) # generate c(1, 2, 3, ..., length(offer))
## [1] 1 2 3 4 5
And now, we can simply go
(1:length(offer))[offer == max(offer)]
## [1] 2
The following is really unnecessarily complicated, but we could have
spec[(1:length(offer))[offer == max(offer)]]
## [1] "cardiologist"
Data frames are R’s way of storing tables (note that the salary data we had was also actually a table).
You can define a data frame using the following syntax:
offers <- data.frame(amount = c(241, 590, 533, 425, 261),
spec = c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist"))
Note the amount
and spec
are the names of columns in the new data frame. We define data frames column-by-column.
The value of offers
will be displayed as follows in the console:
offers
## amount spec
## 1 241 family doc
## 2 590 cardiologist
## 3 533 orthopedic
## 4 425 dermatologist
## 5 261 psychiatrist
Let’s load a data frame (you must previously have successfully run install.packages("babynames")
).
library(babynames) #Load the data frame babynames into R
Here are the first several rows of the data frame
head(babynames) # Display the first 5 rows of a data frame
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
We can access, for example, row 2 and column “year” of the table like so:
babynames[2, "year"]
## # A tibble: 1 x 1
## year
## <dbl>
## 1 1880
If we want to access all of row 2, we can omit the second part:
babynames[2, ]
## # A tibble: 1 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Anna 2604 0.0267
We can access rows 2 through 6 like so:
babynames[2:6, ]
## # A tibble: 5 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Anna 2604 0.0267
## 2 1880 F Emma 2003 0.0205
## 3 1880 F Elizabeth 1939 0.0199
## 4 1880 F Minnie 1746 0.0179
## 5 1880 F Margaret 1578 0.0162
We can only take the columns "n"
and "year"
, like this:
babynames[2:6, c("n", "year")]
## # A tibble: 5 x 2
## n year
## <int> <dbl>
## 1 2604 1880
## 2 2003 1880
## 3 1939 1880
## 4 1746 1880
## 5 1578 1880
Finally, if we want a particular column as a vector (rather than as a data frame), we can do the following (make sure to keep track of the quotes)
babynames[5:20, ]$name
## [1] "Minnie" "Margaret" "Ida" "Alice" "Bertha" "Sarah"
## [7] "Annie" "Clara" "Ella" "Florence" "Cora" "Martha"
## [13] "Laura" "Nellie" "Grace" "Carrie"
$
vs. [, "colname"]
Note that sometimes, df[, "colname"]
will yield a vector (when you are operating on one kind of data frames), and sometimes, it will a data frame. df$colname
will always yield a vector. So if you want a vector, use the $
operator.
Let’s now write code that finds the most common name in 1999
babies.1999 <- babynames[babynames$year == 1999 & babynames$sex == "F", ]
max.name.count <- max(babies.1999$n)
(babies.1999$name)[max.name.count == babies.1999$n]
## [1] "Emily"
Let’s make this into a more general function:
most.common.name <- function(babynames, year, sex){
b.year.sex <- babynames[babynames$year == year & babynames$sex == sex, ]
(b.year.sex$name)[b.year.sex$n == max(b.year.sex$n)]
}
tidyverse
To use pipes, first you need to install and load the tidyverse
library. Use the following:
install.packages("tidyverse")
library(tidyverse)
library(tidyverse)
Let’s define two functions:
f <- function(x){
x^2
}
g <- function(y){
y + 1
}
We can compute f(g(5))
and g(f(5))
. Note that those are not the same: \(f(g(5)) = f(6) = 36\) but \(g(f(5)) = g(25) = 26\).
Now, we’ll compute the same qunatities using pipes:
5 %>% f %>% g
## [1] 26
This is the same as computing g(f(10))
. The way to think about it is this: we start with 10, then apply f
to 10 and obtain f(10)
, and then apply g to 10 %>% f
(i.e., f(10)
) and obtain f(g(10))
.
To compute f(g(10))
, we can use:
5 %>% g %>% f
## [1] 36
This is the same as
5 %>% g() %>% f()
## [1] 36
If the only thing that we’re sending to f
is g(5)
, the parentheses are optional in this notation.
This is useful mostly because it’s easier to read something like x %>% f1 %>% f2 %>% f3 %>% f4
than to read f4(f3(f2(f1(x))))
.
tidyverse
/dplyr
to wrangle datafilter
The filter
function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year
is 1880 and sex
is "F"
.
filter(babynames, year == 1880, sex == "F")
idx <- (babynames$year == 1880) & (babynames$sex == "F")
babynames[idx, ]
babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]
babynames %>% filter(year == 1880 & sex == "F")
babynames %>% filter(year == 1880, sex == "F")
The last way is probably preferable, though different people might have different preferences.
select
We can use select
as follows:
b.y.n <- babynames %>% select(year, name)
b.y.n[1:20, ]
## # A tibble: 20 x 2
## year name
## <dbl> <chr>
## 1 1880 Mary
## 2 1880 Anna
## 3 1880 Emma
## 4 1880 Elizabeth
## 5 1880 Minnie
## 6 1880 Margaret
## 7 1880 Ida
## 8 1880 Alice
## 9 1880 Bertha
## 10 1880 Sarah
## 11 1880 Annie
## 12 1880 Clara
## 13 1880 Ella
## 14 1880 Florence
## 15 1880 Cora
## 16 1880 Martha
## 17 1880 Laura
## 18 1880 Nellie
## 19 1880 Grace
## 20 1880 Carrie
We only kept the year
and name
columns here.
summarize
and group_by
First, let’s use summarize
to compute the average number of babies per name:
babynames %>% summarize(b.pername = mean(n))
## # A tibble: 1 x 1
## b.pername
## <dbl>
## 1 181.
Inside of summarize
, we can refer to columns in the data frame we are processing by name, without the $
operator. (And you shouldn’t be using the $
operator.) This is basically the same as not using summarize
and computing mean(babynames$n)
– not that interesting.
The real power of summarize
is in using group_by
– we can group rows and compute a function such as mean
for every group of rows separately. For example, we could compute the average number of babies per name for different sexes separately.
babynames %>% group_by(sex) %>%
summarize(b.pername = mean(n))
## # A tibble: 2 x 2
## sex b.pername
## <chr> <dbl>
## 1 F 151.
## 2 M 223.
We see that the number of babies per name for "F"
is much smaller than for "M"
– female names are more diverse with fewer babies per name, and male names are less diverse with more babies per name.