Week 3 Lecture 2

Named arguments

We’ll be working with the minus function, defined below:

minus <- function(a, b){
  a - b
}

Up until now, we have mostly been calling functions using

minus(5, 2)

## [1] 3

It is possible to use named arguments when calling the function:

minus(a = 5, b = 2)

## [1] 3

When using named arguments, we can list them in any order:

minus(b = 2, a = 5)

## [1] 3

(Note: this is one of the cases in which we use a single = sign.)

Review of `sapply`

sapply is a way to apply a function to every element of the vector. For example, suppose we want to apply my.abs to every element of the vector vec:

my.abs <- function(x){
  if(x >= 0){
    x
  } else{
    -x
  }
}

vec <- c(5, -6, 7, -8)
sapply(X = vec, FUN = my.abs)

## [1] 5 6 7 8

Note that we used named arguments when using sapply above. (We strictly speaking didn’t have to – sapply(vec, my.abs) works as well, but using named arguments in this case makes things more clear, as you’ll see in a moment.)

Suppose we want to compute elem - 2 for every element elem in the vector vec using the function minus, by calling sapply. We can do this as follows:

sapply(X = vec, FUN = minus, b = 2)

## [1]   3  -8   5 -10

We can specify that the 2 is sent into b. But note that X = vec is special: this will be the first argument given to minus, i.e., every element of X = vec will be sent to a.

The `n` function

The n function can be used inside of summarize, mutate and filter. Its job is to count the number of rows in each group.

The total number of entries in the data frame:

babynames %>% rename(num = n) %>% 
              summarize(total.entries = n())

## # A tibble: 1 x 1
##   total.entries
##           <int>
## 1       1924665

(Not quite the same as all the name-year combinations)

babynames %>% summarize(name.year.count = n_distinct(year, name))

## # A tibble: 1 x 1
##   name.year.count
##             <int>
## 1         1756284

But the same as:

babynames %>% summarize(name.year.sex.count = n_distinct(year, name, sex))

## # A tibble: 1 x 1
##   name.year.sex.count
##                 <int>
## 1             1924665

Total number of names corresponding to more than 250,000 babies since the year 2000:

babynames %>% rename(num = n) %>% 
              filter(year > 2000) %>% 
              group_by(name) %>% 
              summarize(total.babies = sum(num)) %>% 
              filter(total.babies > 250000) %>% 
              summarize(num.names = n())

## # A tibble: 1 x 1
##   num.names
##       <int>
## 1        21

The OKCupid data reprise

(See Week 3 Lecture 1)

A first look at DataViz with `ggplot`

Note: this part of the lecture is adapted from Kieran Healy’s Data Visualization Ch. 3.

Let’s plot the life expectancy vs. the GDP per capita. (Note: the standard terminology is “y vs. x” – the variable that goes on the y-axis is first.)

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
       geom_point()

We could plot GDP per capita vs. life expectancy as well. Generally, we think of the variable plotted on the y-axis as something we might (possibly) predict, and the variable plotted on the x-axis as something we could (possibly) manipulate. Things are not super clear-cut in this case, but you could imagine a quasi-manipulation of the GDP per capita – think of a small country discovering a large oil field. (That actually happened!)

The mapping is the aesthetic mapping: we say we, for example, map the life expectancy to the y axis.

ggplot allows us to add layers using +. First, let’s see how we can display the data as a smooth curve.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_smooth(method = "loess", se = F)

(se = F means we are not displaying the “ribbon” that indicates the uncertainty about the best way to draw the curve.)

Now, let’s use + to display both the scatterplot layer and the smooth curve layer:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F)

We can also display the ribbon:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "loess")

(Note that we didn’t write se = T. T is the default value of se.)

loess is one kind of smooth curve. Another kind is a straight line:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm")

The best straight line doesn’t seem very good!

We might decide to display the data by plotting x on a log-scale: that means that the x scale gets more “compressed” for larger x’s. The effect on the plot is that the plot gets “stretched” more for larger x’s.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_log10()

We can make the plot nicer by specifying that the units of gdpPercap are dollars:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_log10(labels = scales::dollar)

Note that we could have instead plotted lifeExp vs log10(gdpPercap). This would make labelling the axes more annoying though:

ggplot(data = gapminder, mapping = aes(x = log10(gdpPercap), y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm")

We can set colors etc.:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
    geom_point(color = "darkorange") +
    geom_smooth(method = "loess", color = "black") +
    scale_x_log10()

Here is how to label the axes etc using labs and make the points more transparent using alpha = 0.3:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

We can also transform the design of the plot using various pre-made themes:

library(ggthemes)
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10(labels = scales::dollar) + 
  theme_economist() +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

Using color to display variables

So far, we only mapped variables to either the y or the x axis. But what if we want to display more than two variables on the same plot? One possibility is to map the third variable to color. For example, we might like to display the data for 1982, and to indicate which point came from which continent:

library(ggthemes)

gap.1982 <- gapminder %>% filter(year == 1982)
ggplot(data = gap.1982, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() + 
  scale_x_log10(labels = scales::dollar) +
  theme_economist() + 
  scale_color_economist()