We’ll be working with the minus
function, defined below:
minus <- function(a, b){
a - b
}
Up until now, we have mostly been calling functions using
minus(5, 2)
## [1] 3
It is possible to use named arguments when calling the function:
minus(a = 5, b = 2)
## [1] 3
When using named arguments, we can list them in any order:
minus(b = 2, a = 5)
## [1] 3
(Note: this is one of the cases in which we use a single =
sign.)
sapply
sapply
is a way to apply a function to every element of the vector. For example, suppose we want to apply my.abs
to every element of the vector vec
:
my.abs <- function(x){
if(x >= 0){
x
} else{
-x
}
}
vec <- c(5, -6, 7, -8)
sapply(X = vec, FUN = my.abs)
## [1] 5 6 7 8
Note that we used named arguments when using sapply
above. (We strictly speaking didn’t have to – sapply(vec, my.abs)
works as well, but using named arguments in this case makes things more clear, as you’ll see in a moment.)
Suppose we want to compute elem - 2
for every element elem
in the vector vec
using the function minus
, by calling sapply
. We can do this as follows:
sapply(X = vec, FUN = minus, b = 2)
## [1] 3 -8 5 -10
We can specify that the 2
is sent into b
. But note that X = vec
is special: this will be the first argument given to minus
, i.e., every element of X = vec
will be sent to a
.
n
functionThe n
function can be used inside of summarize
, mutate
and filter
. Its job is to count the number of rows in each group.
The total number of entries in the data frame:
babynames %>% rename(num = n) %>%
summarize(total.entries = n())
## # A tibble: 1 x 1
## total.entries
## <int>
## 1 1924665
(Not quite the same as all the name-year combinations)
babynames %>% summarize(name.year.count = n_distinct(year, name))
## # A tibble: 1 x 1
## name.year.count
## <int>
## 1 1756284
But the same as:
babynames %>% summarize(name.year.sex.count = n_distinct(year, name, sex))
## # A tibble: 1 x 1
## name.year.sex.count
## <int>
## 1 1924665
Total number of names corresponding to more than 250,000 babies since the year 2000:
babynames %>% rename(num = n) %>%
filter(year > 2000) %>%
group_by(name) %>%
summarize(total.babies = sum(num)) %>%
filter(total.babies > 250000) %>%
summarize(num.names = n())
## # A tibble: 1 x 1
## num.names
## <int>
## 1 21
(See Week 3 Lecture 1)
ggplot
Note: this part of the lecture is adapted from Kieran Healy’s Data Visualization Ch. 3.
Let’s plot the life expectancy vs. the GDP per capita. (Note: the standard terminology is “y vs. x” – the variable that goes on the y-axis is first.)
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
We could plot GDP per capita vs. life expectancy as well. Generally, we think of the variable plotted on the y-axis as something we might (possibly) predict, and the variable plotted on the x-axis as something we could (possibly) manipulate. Things are not super clear-cut in this case, but you could imagine a quasi-manipulation of the GDP per capita – think of a small country discovering a large oil field. (That actually happened!)
The mapping
is the aesthetic mapping: we say we, for example, map the life expectancy to the y axis.
ggplot
allows us to add layers using +
. First, let’s see how we can display the data as a smooth curve.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_smooth(method = "loess", se = F)
(se = F
means we are not displaying the “ribbon” that indicates the uncertainty about the best way to draw the curve.)
Now, let’s use +
to display both the scatterplot layer and the smooth curve layer:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "loess", se = F)
We can also display the ribbon:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "loess")
(Note that we didn’t write se = T
. T
is the default value of se
.)
loess
is one kind of smooth curve. Another kind is a straight line:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm")
The best straight line doesn’t seem very good!
We might decide to display the data by plotting x on a log-scale: that means that the x scale gets more “compressed” for larger x’s. The effect on the plot is that the plot gets “stretched” more for larger x’s.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10()
We can make the plot nicer by specifying that the units of gdpPercap are dollars:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::dollar)
Note that we could have instead plotted lifeExp
vs log10(gdpPercap)
. This would make labelling the axes more annoying though:
ggplot(data = gapminder, mapping = aes(x = log10(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm")
We can set colors etc.:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "darkorange") +
geom_smooth(method = "loess", color = "black") +
scale_x_log10()
Here is how to label the axes etc using labs
and make the points more transparent using alpha = 0.3
:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
We can also transform the design of the plot using various pre-made themes:
library(ggthemes)
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::dollar) +
theme_economist() +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
So far, we only mapped variables to either the y or the x axis. But what if we want to display more than two variables on the same plot? One possibility is to map the third variable to color. For example, we might like to display the data for 1982, and to indicate which point came from which continent:
library(ggthemes)
gap.1982 <- gapminder %>% filter(year == 1982)
ggplot(data = gap.1982, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10(labels = scales::dollar) +
theme_economist() +
scale_color_economist()