Data frames that comform to the principle of tidy data have one row per data point, and one column per variable. The data frame gapminder
conforms to the principle of tidy data
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Note that there is one data point (i.e., measurement) per row. Each column corresponds to a different quantity that’s being measured. (We could countries etc.). Here is an example of data that is not tidy. (We are not discussing spread
in detail right now)
spread(gapminder %>% select(country, year, lifeExp), key=country, value = lifeExp)[, c(1:7)]
## # A tibble: 12 x 7
## year Afghanistan Albania Algeria Angola Argentina Australia
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1952 28.8 55.2 43.1 30.0 62.5 69.1
## 2 1957 30.3 59.3 45.7 32.0 64.4 70.3
## 3 1962 32.0 64.8 48.3 34 65.1 70.9
## 4 1967 34.0 66.2 51.4 36.0 65.6 71.1
## 5 1972 36.1 67.7 54.5 37.9 67.1 71.9
## 6 1977 38.4 68.9 58.0 39.5 68.5 73.5
## 7 1982 39.9 70.4 61.4 39.9 69.9 74.7
## 8 1987 40.8 72 65.8 39.9 70.8 76.3
## 9 1992 41.7 71.6 67.7 40.6 71.9 77.6
## 10 1997 41.8 73.0 69.2 41.0 73.3 78.8
## 11 2002 42.1 75.7 71.0 41.0 74.3 80.4
## 12 2007 43.8 76.4 72.3 42.7 75.3 81.2
Here, the life expectancies are recorded in multiple columns, not just one; and there are multiple measurements per row. This is a more concise way of presenting the information, if all we need to do is present the life expectancies in different countries in different years. This format is called “wide” (as opposed to the “long” (tidy) format of the original gapminder
data frame – you can see why).