“Syntax” is the set of rules according to which R statements must be constructed. For example, as we saw in the previous lecture, to construct an if-statement, you must write things as
if(<CONDITION1>){
<STATEMENT1>
<STATEMENT2>
...
}else if(<CONDITION2>){
<STATEMENT3>
<STATEMENT4>
...
}else if(<CONDITION3>){
<STATEMENT5>
<STATEMENT6>
...
}...else{
<STATEMENT7>
...
}
The curly braces and brackets are mandatory. Violating syntax rules would usually result in R producing an error.
(Syntax is distinct from semantics – the rules that determine the meaning of R statements. This is analoguous to terms used for human languages.)
Here is an example of a vector
offer <- c(241, 590, 533, 425, 261)
A vector is a sequence of values of the same type (you can also have vectors of characters). Here is how you can access elements of a vector:
offer[1]
## [1] 241
offer[4]
## [1] 425
You can find the length (i.e., the number of elements of a vector) like so:
length(offer)
## [1] 5
Aside: technically, everything in R is a vector. It’s just that some vectors have length 1.
a <- 42
a[1]
## [1] 42
You can even do the following, if you insist:
42[1]
## [1] 42
Here are some things we can do with vectors:
sort(offer) # Compute a sorted version of the vector, in increasing order
## [1] 241 261 425 533 590
unique(c(1, 2, 1, 4, 5, 2)) # Get a vector with every element of the input appearing once
## [1] 1 2 4 5
max(offer)
## [1] 590
min(offer)
## [1] 241
Here is how we can compute the max
without using the max
function:
sort(offer)[length(offer)]
## [1] 590
(Explain this!)
We can also perform operations on all the elements of a vector.
offer > 533
## [1] FALSE TRUE FALSE FALSE FALSE
offer == 533
## [1] FALSE FALSE TRUE FALSE FALSE
A powerful way of extracting elements from a vector is by indexing the vector using logical vectors. Here is how we can get the third and fourth elements of the vector:
offer[c(F, T, T, F, F)]
## [1] 590 533
Of course, we could compute the logical vector above using
offer > 500
## [1] FALSE TRUE TRUE FALSE FALSE
Combining those, we can go
offer[offer > 500]
## [1] 590 533
Note what we did: we extracted the elements that are larger than 500.
Suppose we wanted something fancier: extracting all elements between 400 and 550. In other words, suppose we want the elements that are both larger than 400 and smaller than 550. To achieve this, we would want to combine the conditions “larger than 400” and “smaller than 550”.
We can do this using logical operators
AND: a & b. TRUE only if a is TRUE and b is TRUE. FALSE otherwise
OR: a | b. TRUE if at least one of a or b is TRUE, FALSE otherwise
NOT: !a. TRUE if a is FALSE, FALSE if a is TRUE
Some examples:
pie <- TRUE
icecream <- FALSE
pie | icecream
## [1] TRUE
pie <- FALSE
icecream <- FALSE
pie | icecream
## [1] FALSE
pie <- TRUE
icecream <- FALSE
pie & icecream
## [1] FALSE
pie <- TRUE
icecream <- TRUE
pie | icecream
## [1] TRUE
Note: this is not quite how it works in English. If I say I will have pie or icecream, and then have both, that means what I said wasn’t true. But for R, the expression pie | icecream
is TRUE. Technically, |
is called “inclusive OR” (as opposed to the “exclusive OR” we usually mean in English.)
pie <- FALSE
icecream <- TRUE
pie | icecream
## [1] TRUE
pie <- FALSE
icecream <- FALSE
pie | icecream
## [1] FALSE
pie <- TRUE
!pie
## [1] FALSE
So with all that said, we can go back to our original task:
offer[offer > 400 & offer < 550]
## [1] 533 425
Now, how could we get the **other* elements? Here is one way
offer[!(offer > 400 & offer < 550)]
## [1] 241 590 261
and here is another:
offer[offer <= 400 | offer >= 550]
## [1] 241 590 261
Why do both make sense? Why do we use >=
and not >
in the last one?
Let’s go back to offer
. Actually, those were the average offers to various specialties of doctors:
offer <- c(241, 590, 533, 425, 261)
spec <- c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist")
So for example the offers to cardiologist were $590k per year in 2018. Suppose we want to automatically figure out which specialty makes the most money.
So, if we knew it were specialty number 2, we’d go spec[2]
. But how do we figure out the 2
?
Let’s try another tack. We’d like to compute spec[c(F, T, F, F, F)]
. Can we compute the logical vector? Here’s an idea:
offer == max(offer)
## [1] FALSE TRUE FALSE FALSE FALSE
Combining those ideas, we’ll get
spec[offer == max(offer)]
## [1] "cardiologist"
So how do we get the 2? We can first generate the vector
1:length(offer) # generate c(1, 2, 3, ..., length(offer))
## [1] 1 2 3 4 5
And now, we can simply go
(1:length(offer))[offer == max(offer)]
## [1] 2
The following is really unnecessarily complicated, but we could have
spec[(1:length(offer))[offer == max(offer)]]
## [1] "cardiologist"
Data frames are R’s way of storing tables (note that the salary data we had was also actually a table). Let’s load a data frame (you must previously have successfully run install.packages("babynames")
).
library(babynames) #Load the data frame babynames into R
Here are the first several rows of the data frame
head(babynames) # Display the first 5 rows of a data frame
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
We can access, for example, row 2 and column “year” of the table like so:
babynames[2, "year"]
## # A tibble: 1 x 1
## year
## <dbl>
## 1 1880
If we want to access all of row 2, we can omit the second part:
babynames[2, ]
## # A tibble: 1 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Anna 2604 0.0267
We can access rows 2 through 6 like so:
babynames[2:6, ]
## # A tibble: 5 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Anna 2604 0.0267
## 2 1880 F Emma 2003 0.0205
## 3 1880 F Elizabeth 1939 0.0199
## 4 1880 F Minnie 1746 0.0179
## 5 1880 F Margaret 1578 0.0162
We can only take the columns "n"
and "year"
, like this:
babynames[2:6, c("n", "year")]
## # A tibble: 5 x 2
## n year
## <int> <dbl>
## 1 2604 1880
## 2 2003 1880
## 3 1939 1880
## 4 1746 1880
## 5 1578 1880
Finally, if we want a particular column as a vector (rather than as a data frame), we can do the following (make sure to keep track of the quotes)
babynames[5:20, ]$name
## [1] "Minnie" "Margaret" "Ida" "Alice" "Bertha" "Sarah"
## [7] "Annie" "Clara" "Ella" "Florence" "Cora" "Martha"
## [13] "Laura" "Nellie" "Grace" "Carrie"
Let’s now write code that finds the most common name in 1999
babies.1999 <- babynames[babynames$year == 1999, ]
max.name.count <- max(babies.1999$n)
babies.1999[max.name.count == babies.1999$n, "name"]
## # A tibble: 1 x 1
## name
## <chr>
## 1 Jacob
And now, let’s write a function to find the most common name in a given year, with a given sex
MostCommonName <- function(babynames, year, sex){
babies.match <- babynames[babynames$year == year & babynames$sex == sex, ]
max.name.count <- max(babies.match$n)
return(babies.match[max.name.count == babies.match$n, "name"])
}
Comments
Comments are meant to help humans understand R code. Anything that follow the pound sign
#
is ignored by R. You can use comments to explain your code (to your partner, preceptor, or your future self), or to temporarily disable parts of your code.