SML201 Week 1 Lecture 2

Comments

Comments are meant to help humans understand R code. Anything that follow the pound sign # is ignored by R. You can use comments to explain your code (to your partner, preceptor, or your future self), or to temporarily disable parts of your code.

cat("hi") # This part will be ignored by R

## hi

# You can write comments here too
cat("Things are happenning again")

## Things are happenning again

Syntax

“Syntax” is the set of rules according to which R statements must be constructed. For example, as we saw in the previous lecture, to construct an if-statement, you must write things as

if(<CONDITION1>){
  <STATEMENT1>
  <STATEMENT2>
  ...
}else if(<CONDITION2>){
  <STATEMENT3>
  <STATEMENT4>
  ...
}else if(<CONDITION3>){
  <STATEMENT5>
  <STATEMENT6>
  ...
}...else{
 <STATEMENT7>
 ...
}

The curly braces and brackets are mandatory. Violating syntax rules would usually result in R producing an error.

(Syntax is distinct from semantics – the rules that determine the meaning of R statements. This is analoguous to terms used for human languages.)

Vectors

Here is an example of a vector

offer <- c(241, 590, 533, 425, 261)

A vector is a sequence of values of the same type (you can also have vectors of characters). Here is how you can access elements of a vector:

offer[1]

## [1] 241

offer[4]

## [1] 425

You can find the length (i.e., the number of elements of a vector) like so:

length(offer)

## [1] 5

Aside: technically, everything in R is a vector. It’s just that some vectors have length 1.

a <- 42
a[1]

## [1] 42

You can even do the following, if you insist:

42[1]

## [1] 42

Here are some things we can do with vectors:

sort(offer) # Compute a sorted version of the vector, in increasing order

## [1] 241 261 425 533 590

unique(c(1, 2, 1, 4, 5, 2)) # Get a vector with every element of the input appearing once

## [1] 1 2 4 5

max(offer)

## [1] 590

min(offer)

## [1] 241

Here is how we can compute the max without using the max function:

sort(offer)[length(offer)]

## [1] 590

(Explain this!)

We can also perform operations on all the elements of a vector.

offer > 533

## [1] FALSE  TRUE FALSE FALSE FALSE

offer == 533

## [1] FALSE FALSE  TRUE FALSE FALSE

Indexing vectors with logical vectors

A powerful way of extracting elements from a vector is by indexing the vector using logical vectors. Here is how we can get the third and fourth elements of the vector:

offer[c(F, T, T, F, F)]

## [1] 590 533

Of course, we could compute the logical vector above using

offer > 500

## [1] FALSE  TRUE  TRUE FALSE FALSE

Combining those, we can go

offer[offer > 500]

## [1] 590 533

Note what we did: we extracted the elements that are larger than 500.

Operating on logical values

Suppose we wanted something fancier: extracting all elements between 400 and 550. In other words, suppose we want the elements that are both larger than 400 and smaller than 550. To achieve this, we would want to combine the conditions “larger than 400” and “smaller than 550”.

We can do this using logical operators

AND:    a & b. TRUE only if a is TRUE and b is TRUE. FALSE otherwise
OR:     a | b. TRUE if at least one of a or b is TRUE, FALSE otherwise
NOT:    !a. TRUE if a is FALSE, FALSE if a is TRUE

Some examples:

pie <- TRUE
icecream <- FALSE
pie | icecream

## [1] TRUE

pie <- FALSE
icecream <- FALSE
pie | icecream

## [1] FALSE

pie <- TRUE
icecream <- FALSE
pie & icecream

## [1] FALSE

pie <- TRUE
icecream <- TRUE
pie | icecream

## [1] TRUE

Note: this is not quite how it works in English. If I say I will have pie or icecream, and then have both, that means what I said wasn’t true. But for R, the expression pie | icecream is TRUE. Technically, | is called “inclusive OR” (as opposed to the “exclusive OR” we usually mean in English.)

pie <- FALSE
icecream <- TRUE
pie | icecream

## [1] TRUE

pie <- FALSE
icecream <- FALSE
pie | icecream

## [1] FALSE

pie <- TRUE
!pie

## [1] FALSE

So with all that said, we can go back to our original task:

offer[offer > 400 & offer < 550]

## [1] 533 425

Now, how could we get the **other* elements? Here is one way

offer[!(offer > 400 & offer < 550)]

## [1] 241 590 261

and here is another:

offer[offer <= 400 | offer >= 550]

## [1] 241 590 261

Why do both make sense? Why do we use >= and not > in the last one?

Parallel vectors

Let’s go back to offer. Actually, those were the average offers to various specialties of doctors:

offer <- c(241, 590, 533, 425, 261)
spec <- c("family doc", "cardiologist", "orthopedic", "dermatologist", "psychiatrist")

So for example the offers to cardiologist were $590k per year in 2018. Suppose we want to automatically figure out which specialty makes the most money.

So, if we knew it were specialty number 2, we’d go spec[2]. But how do we figure out the 2?

Let’s try another tack. We’d like to compute spec[c(F, T, F, F, F)]. Can we compute the logical vector? Here’s an idea:

offer == max(offer)

## [1] FALSE  TRUE FALSE FALSE FALSE

Combining those ideas, we’ll get

spec[offer == max(offer)]

## [1] "cardiologist"

So how do we get the 2? We can first generate the vector

1:length(offer)  # generate c(1, 2, 3, ..., length(offer))

## [1] 1 2 3 4 5

And now, we can simply go

(1:length(offer))[offer == max(offer)]

## [1] 2

The following is really unnecessarily complicated, but we could have

spec[(1:length(offer))[offer == max(offer)]]

## [1] "cardiologist"

Data frames

Data frames are R’s way of storing tables (note that the salary data we had was also actually a table). Let’s load a data frame (you must previously have successfully run install.packages("babynames")).

library(babynames) #Load the data frame babynames into R

Here are the first several rows of the data frame

head(babynames) # Display the first 5 rows of a data frame

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

We can access, for example, row 2 and column “year” of the table like so:

babynames[2, "year"]

## # A tibble: 1 x 1
##    year
##   <dbl>
## 1  1880

If we want to access all of row 2, we can omit the second part:

babynames[2, ]

## # A tibble: 1 x 5
##    year sex   name      n   prop
##   <dbl> <chr> <chr> <int>  <dbl>
## 1  1880 F     Anna   2604 0.0267

We can access rows 2 through 6 like so:

babynames[2:6, ]

## # A tibble: 5 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Anna       2604 0.0267
## 2  1880 F     Emma       2003 0.0205
## 3  1880 F     Elizabeth  1939 0.0199
## 4  1880 F     Minnie     1746 0.0179
## 5  1880 F     Margaret   1578 0.0162

We can only take the columns "n" and "year", like this:

babynames[2:6, c("n", "year")]

## # A tibble: 5 x 2
##       n  year
##   <int> <dbl>
## 1  2604  1880
## 2  2003  1880
## 3  1939  1880
## 4  1746  1880
## 5  1578  1880

Finally, if we want a particular column as a vector (rather than as a data frame), we can do the following (make sure to keep track of the quotes)

babynames[5:20, ]$name

##  [1] "Minnie"   "Margaret" "Ida"      "Alice"    "Bertha"   "Sarah"   
##  [7] "Annie"    "Clara"    "Ella"     "Florence" "Cora"     "Martha"  
## [13] "Laura"    "Nellie"   "Grace"    "Carrie"

Working with data frames

Let’s now write code that finds the most common name in 1999

babies.1999 <- babynames[babynames$year == 1999, ]
max.name.count <- max(babies.1999$n)
babies.1999[max.name.count == babies.1999$n, "name"]

## # A tibble: 1 x 1
##   name 
##   <chr>
## 1 Jacob

And now, let’s write a function to find the most common name in a given year, with a given sex

MostCommonName <- function(babynames, year, sex){
  babies.match <- babynames[babynames$year == year & babynames$sex == sex, ]
  max.name.count <- max(babies.match$n)
  return(babies.match[max.name.count == babies.match$n, "name"])
}