A first look at DataViz with ggplot

Note: this part of the lecture is adapted from Kieran Healy’s Data Visualization Ch. 3.

Let’s plot the life expectancy vs. the GDP per capita. (Note: the standard terminology is “y vs. x” – the variable that goes on the y-axis is first.)

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
       geom_point()

We could plot GDP per capita vs. life expectancy as well. Generally, we think of the variable plotted on the y-axis as something we might (possibly) predict, and the variable plotted on the x-axis as something we could (possibly) manipulate. Things are not super clear-cut in this case, but you could imagine a quasi-manipulation of the GDP per capita – think of a small country discovering a large oil field. (That actually happened!)

The mapping is the aesthetic mapping: we say we, for example, map the life expectancy to the y axis.

ggplot allows us to add layers using +. First, let’s see how we can display the data as a smooth curve.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_smooth(method = "loess", se = F)

(se = F means we are not displaying the “ribbon” that indicates the uncertainty about the best way to draw the curve.)

Now, let’s use + to display both the scatterplot layer and the smooth curve layer:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F) 

We can also display the ribbon:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "loess") 

(Note that we didn’t write se = T. T is the default value of se.)

loess is one kind of smooth curve. Another kind is a straight line:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") 

The best straight line doesn’t seem very good!

We might decide to display the data by plotting x on a log-scale: that means that the x scale gets more “compressed” for larger x’s. The effect on the plot is that the plot gets “stretched” more for larger x’s.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_log10()

We can make the plot nicer by specifying that the units of gdpPercap are dollars:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_log10(labels = scales::dollar)

Note that we could have instead plotted lifeExp vs log10(gdpPercap). This would make labelling the axes more annoying though:

ggplot(data = gapminder, mapping = aes(x = log10(gdpPercap), y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm")

We can set colors etc.:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
    geom_point(color = "darkorange") +
    geom_smooth(method = "loess", color = "black") +
    scale_x_log10()

Here is how to label the axes etc using labs and make the points more transparent using alpha = 0.3:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

We can also transform the design of the plot using various pre-made themes:

library(ggthemes)
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  scale_x_log10(labels = scales::dollar) + 
  theme_economist() +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

Using color to display variables

So far, we only mapped variables to either the y or the x axis. But what if we want to display more than two variables on the same plot? One possibility is to map the third variable to color. For example, we might like to display the data for 1982, and to indicate which point came from which continent:

gap.1982 <- gapminder %>% filter(year == 1982)
ggplot(data = gap.1982, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() + 
  scale_x_log10(labels = scales::dollar)