Suppose we’re interested in the influence of Harry Potter on baby names.
library(babynames)
b <- babynames %>% filter( ((name == "Harry") & (sex == "M")) | ((name == "Hermione") & (sex == "F"))) %>%
filter(year >= 1980) %>%
select(year, name, n)
b
## # A tibble: 55 x 3
## year name n
## <dbl> <chr> <int>
## 1 1980 Harry 859
## 2 1981 Harry 836
## 3 1982 Harry 767
## 4 1983 Harry 753
## 5 1984 Harry 727
## 6 1985 Harry 781
## 7 1986 Harry 677
## 8 1987 Harry 710
## 9 1988 Harry 671
## 10 1989 Harry 747
## # … with 45 more rows
The data frame b
is in “tidy data” form: every row contains one measurement (the number of babies born, for a particular name and a particular year). This is also know as “long form.”
So far in this class, pretty much all of our data was in “long form.” It might be difficult to imagine how else we could store the data in b
in a data frame.
Here is a way:
wide <- b %>% pivot_wider(names_from = name, values_from = n)
wide
## # A tibble: 38 x 3
## year Harry Hermione
## <dbl> <int> <int>
## 1 1980 859 NA
## 2 1981 836 NA
## 3 1982 767 NA
## 4 1983 753 NA
## 5 1984 727 NA
## 6 1985 781 NA
## 7 1986 677 NA
## 8 1987 710 NA
## 9 1988 671 NA
## 10 1989 747 NA
## # … with 28 more rows
What we did is take the values in the column n
, and distributed it across two columns: one column for each category in the column b$name
.
We can go back to long form using pivote_longer
:
long <- wide %>% pivot_longer(cols = c(Harry, Hermione), names_to = "name", values_to = "n")
long
## # A tibble: 76 x 3
## year name n
## <dbl> <chr> <int>
## 1 1980 Harry 859
## 2 1980 Hermione NA
## 3 1981 Harry 836
## 4 1981 Hermione NA
## 5 1982 Harry 767
## 6 1982 Hermione NA
## 7 1983 Harry 753
## 8 1983 Hermione NA
## 9 1984 Harry 727
## 10 1984 Hermione NA
## # … with 66 more rows
We created a new column name
, and gathered values from the columns Harry
and Hermione
into one column n
. You see some NA
s there: that’s because data for Hermione
before 2000 is indeed missing (because babynames
only has data for 5 births or more a year).
We can now display the data:
ggplot(long) +
geom_smooth(mapping = aes(x = year, y = n, color = name), method = "loess")
## Warning: Removed 21 rows containing non-finite values (stat_smooth).
We can add 0’s to the data if we like. is.na(vec)
returns a vector of T’s and F’s, where T’s correspond to NA
s:
a <- long$n
a[is.na(a)] <- 0
long$n <- a
Let’s display this again:
ggplot(long) +
geom_smooth(mapping = aes(x = year, y = n, color = name), method = "loess")
ggplot
Note that ggplot
doesn’t work that well with wide tables: we could automatically display captions for the long table, but couldn’t do that for the wide table, since we couldn’t map color to name
there.