Histograms and barchart with ggplot

titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")

Let’s plot a histogram of the passengers on the Titanic. On the x-axis, we’ll display the survival category (survived/did not survive). On the y-axis, we’ll display the count.

ggplot(data = titanic, mapping = aes(x = Survived)) + 
  geom_bar(aes(y = ..count..))

We can instead display a proportion:

ggplot(data = titanic, mapping = aes(x = Survived)) + 
  geom_bar(aes(y = ..prop..))

The variable Survived is displayed as a numeric, which doesn’t really make sense.

titanic[, "cSurvived"] <- ""
titanic[titanic$Survived==1, "cSurvived"] <- "Survived"
titanic[titanic$Survived==0, "cSurvived"] <- "Died"

ggplot(data = titanic, mapping = aes(x = cSurvived)) + 
  geom_bar(mapping = aes(y = ..count..), fill="tomato2") + xlab("Survival status")

Instead of the count, we might like to display the proportion of each of the categories out of the total. The proportion is just the ..count.. divided by the size of the dataset.

ggplot(data = titanic, mapping = aes(x = cSurvived)) + 
  geom_bar(mapping = aes(y = ..prop.., group = 1), fill="tomato1") + xlab("Survival status")

We needed to specify group = 1 here, since otherwise the proportion of the people who died would be computed out of all the people who died. group = 1 means that the rows in the dataframe are considered to be one group.

Here is another example:

ggplot(data = titanic, mapping = aes(x = cSurvived, fill = as.factor(Pclass))) + 
            geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = as.factor(Pclass))) + xlab("")

Here, the proportion is taken out of passengers in class 1, passengers in class 2, and passengers in class 3. (Which proportions add up to 1?)

The default position is “stack”:

ggplot(data = titanic, mapping = aes(x = cSurvived, fill = as.factor(Pclass))) + 
            geom_bar(position = "stack", mapping = aes(y = ..prop.., group = as.factor(Pclass))) + xlab("")

This is still not ideal. Suppose we want to know both the proportion of Died/Survived and the share of each passenger class.

titanic$Pclass <- as.character(titanic$Pclass)
t1 <- titanic %>% group_by(Pclass, cSurvived) %>% 
                  summarize(num = n()) %>% 
                  ungroup %>% 
                  mutate(p = num/sum(num))

ggplot(data = t1, mapping = aes(x = cSurvived)) + 
            geom_bar(stat = "identity", position = "stack", mapping = aes(y = p, fill = Pclass) ) + xlab("")

We used stat = "identity" to simply display the numbers in the p column which we computed ourselves, without using the special ..n.. or ..prop.. values, which wouldn’t allow us this flexibility. The default is stat = "count".

It is probably more informative to display to make it so that the heights of the two bars add up to 1 like in the original histogram:

A more typical situation is plotting the histogram of a continous variable like age.

ggplot(data = titanic,  mapping = aes(x = Age)) +
      geom_histogram(bins = 10)

Varying the number of bins allows us to display the data more appropriately: too many bins means we’ll see patterns that aren only there because the sample size is too small; too few bins means we won’t see trends that are actually in the data.

ggplot(data = titanic,  mapping = aes(x = Age)) +
      geom_histogram(bins = 100)

ggplot(data = titanic,  mapping = aes(x = Age)) +
      geom_histogram(bins = 3)

We can display overlapping histograms. We specify alpha = 0.4 to indicate that the histograms are partially transparent.

ggplot(data = titanic, mapping = aes(x = Age, fill = Sex)) +
  geom_histogram(alpha = 0.4, bins = 10)