Tidy data

Suppose we’re interested in the influence of Harry Potter on baby names.

library(babynames)
b <- babynames %>% filter( ((name == "Harry") & (sex == "M")) | ((name == "Hermione") & (sex == "F")))   %>% 
  filter(year >= 1980) %>% 
  select(year, name, n)

b
## # A tibble: 55 x 3
##     year name      n
##    <dbl> <chr> <int>
##  1  1980 Harry   859
##  2  1981 Harry   836
##  3  1982 Harry   767
##  4  1983 Harry   753
##  5  1984 Harry   727
##  6  1985 Harry   781
##  7  1986 Harry   677
##  8  1987 Harry   710
##  9  1988 Harry   671
## 10  1989 Harry   747
## # … with 45 more rows

The data frame b is in “tidy data” form: every row contains one measurement (the number of babies born, for a particular name and a particular year). This is also know as “long form.”

So far in this class, pretty much all of our data was in “long form.” It might be difficult to imagine how else we could store the data in b in a data frame.

Here is a way:

wide <- b %>% pivot_wider(names_from = name, values_from = n)
wide
## # A tibble: 38 x 3
##     year Harry Hermione
##    <dbl> <int>    <int>
##  1  1980   859       NA
##  2  1981   836       NA
##  3  1982   767       NA
##  4  1983   753       NA
##  5  1984   727       NA
##  6  1985   781       NA
##  7  1986   677       NA
##  8  1987   710       NA
##  9  1988   671       NA
## 10  1989   747       NA
## # … with 28 more rows

What we did is take the values in the column n, and distributed it across two columns: one column for each category in the column b$name.

We can go back to long form using pivote_longer:

long <- wide %>% pivot_longer(cols = c(Harry, Hermione), names_to = "name", values_to = "n")
long
## # A tibble: 76 x 3
##     year name         n
##    <dbl> <chr>    <int>
##  1  1980 Harry      859
##  2  1980 Hermione    NA
##  3  1981 Harry      836
##  4  1981 Hermione    NA
##  5  1982 Harry      767
##  6  1982 Hermione    NA
##  7  1983 Harry      753
##  8  1983 Hermione    NA
##  9  1984 Harry      727
## 10  1984 Hermione    NA
## # … with 66 more rows

We created a new column name, and gathered values from the columns Harry and Hermione into one column n. You see some NAs there: that’s because data for Hermione before 2000 is indeed missing (because babynames only has data for 5 births or more a year).

We can now display the data:

ggplot(long) + 
  geom_smooth(mapping = aes(x = year, y = n, color = name), method = "loess")
## Warning: Removed 21 rows containing non-finite values (stat_smooth).

We can add 0’s to the data if we like. is.na(vec) returns a vector of T’s and F’s, where T’s correspond to NAs:

a <- long$n
a[is.na(a)] <-  0
long$n <- a

Let’s display this again:

ggplot(long) + 
  geom_smooth(mapping = aes(x = year, y = n, color = name), method = "loess")

Tidy data and ggplot

Note that ggplot doesn’t work that well with wide tables: we could automatically display captions for the long table, but couldn’t do that for the wide table, since we couldn’t map color to name there.