Week 3 Lecture 2

Tidy data

Data frames that comform to the principle of tidy data have one row per data point, and one column per variable. The data frame gapminder conforms to the principle of tidy data

gapminder

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Note that there is one data point (i.e., measurement) per row. Each column corresponds to a different quantity that’s being measured. (We could countries etc.). Here is an example of data that is not tidy. (We are not discussing spread in detail right now)

spread(gapminder %>% select(country, year, lifeExp), key=country, value = lifeExp)[, c(1:7)]

## # A tibble: 12 x 7
##     year Afghanistan Albania Algeria Angola Argentina Australia
##    <int>       <dbl>   <dbl>   <dbl>  <dbl>     <dbl>     <dbl>
##  1  1952        28.8    55.2    43.1   30.0      62.5      69.1
##  2  1957        30.3    59.3    45.7   32.0      64.4      70.3
##  3  1962        32.0    64.8    48.3   34        65.1      70.9
##  4  1967        34.0    66.2    51.4   36.0      65.6      71.1
##  5  1972        36.1    67.7    54.5   37.9      67.1      71.9
##  6  1977        38.4    68.9    58.0   39.5      68.5      73.5
##  7  1982        39.9    70.4    61.4   39.9      69.9      74.7
##  8  1987        40.8    72      65.8   39.9      70.8      76.3
##  9  1992        41.7    71.6    67.7   40.6      71.9      77.6
## 10  1997        41.8    73.0    69.2   41.0      73.3      78.8
## 11  2002        42.1    75.7    71.0   41.0      74.3      80.4
## 12  2007        43.8    76.4    72.3   42.7      75.3      81.2

Here, the life expectancies are recorded in multiple columns, not just one; and there are multiple measurements per row. This is a more concise way of presenting the information, if all we need to do is present the life expectancies in different countries in different years. This format is called “wide” (as opposed to the “long” (tidy) format of the original gapminder data frame – you can see why).