6 Graphing with ggplot2

We’ve made some simple graphs earlier, in this lesson we will use the package ggplot2 to make simple and elegant looking graphs.

The ‘gg’ part of ggplot2 stands for ‘grammar of graphics’ which is the idea that most graphs can be made using the same few ‘pieces’. We’ll get into those pieces during this lesson. For a useful cheatsheet for this package see here

When working with new data, It’s often useful to quickly graph the data to try to understand what you’re working with. It is also useful when understanding how much to trust the data.

The data we will work on is data about alcohol consumption in U.S. states from 1977-2017 from the National Institute of Health. It contains the per capita alcohol consumption for each state for every year. Their method to determine per capita consumption is amount of alcohol sold / number of people aged 14+ living in the state. More details on the data are available here.

Now we need to load the data.

The name of the data is quite long so for convenience let’s copy it to a new object with a better name, alcohol.

The original data has every state, region, and the US as a whole. For this lesson we’re using data subsetted to just include states. For now let’s just look at Pennsylvania.

6.1 What does the data look like?

Before graphing, it’s helpful to see what the data includes. An important thing to check is what variables are available and the units of these variables.

So each row of the data is a single year of data for Pennsylvania. It includes alcohol consumption for wine, liquor, beer, and total drinks - both as gallons of ethanol (a hard unit to interpret) and more traditional measures such as glasses of wine or number of beers. The original data only included the gallons of ethanol data which I converted to the more understandable units. If you encounter data with odd units, it is a good idea to convert it to something easier to understand - especially if you intend to show someone else the data or results!

6.2 Graphing data

To make a plot using ggplot(), all you need to do is specify the data set and the variables you want to plot. From there you add on pieces of the graph using the + symbol and then specify what you want added.

For ggplot() we need to specify 4 things

  1. The data set
  2. The x-axis variable
  3. The y-axis variable
  4. The type of graph - e.g. line, point, etc.

Some useful types of graphs are

  • geom_point() - A point graph, can be used for scatter plots
  • geom_line() - A line graph
  • geom_smooth() - Adds a regression line to the graph
  • geom_bar() - A barplot

6.3 Time-Series Plots

Let’s start with a time-series of beer consumption in Pennsylvania. In time-series plots the x-axis is always the time variable while the y-axis is the variable whose trend over time is what we’re interested in. When you see a graph showing crime rates over time, this is the type of graph you’re looking at.

The code below starts by writing our data set name. Then says what our x- and y-axis variables are called. The x- and y-axis variables are within parentheses of the function called aes(). aes() stands for aesthetic and what’s included inside here describes how the graph will look. It’s not intuitive to remember, but you need to included this.

Note that on the x-axis it prints out every single year and makes it completely unreadable. That is because the “year” column is a character type, so R thinks each year is its own category. It prints every single year because it thinks we want every category shown. To fix this we can make the column numeric and ggplot() will be smarter about printing fewer years.

When we run it we get our graph. It includes the variable names for each axis and shows the range of data through the tick marks. What is missing is the actual data. For that we need to specify what type of graph it is. We literally add it with the + followed by the type of graph we want. Make sure that the + is at the end of a line, not the start of one. Starting a line with the + will not work.

Let’s start with point and line graphs.

We can also combine different types of graphs.

It looks like there’s a huge change in beer consumption over time. But look at where they y-axis starts. It starts around 280 so really that change is only ~60 beers. That’s because when graphs don’t start at 0, it can make small changes appear big. We can fix this by forcing the y-axis to begin at 0. We can add expand_limits(y = 0) to the graph to say that the value 0 must always appear on the y-axis, even if no data is close to that value.

Now that graphs shows what looks like nearly no change even though that is also not true. Which graph is best? It’s hard to say.

Inside the types of graphs we can change how it is displayed. As with using plot(), we can specify the color and size of our lines or points.

Some other useful features are changing the axis labels and the graph title. Unlike in plot() we do not need to include it in the () of ggplot() but use their own functions to add them to the graph.

  • xlab() - x-axis label
  • ylab() - y-axis label
  • ggtitle() - graph title

Many time-series plots show multiple variables over the same time period (e.g. murder and robbery over time). There are ways to change the data itself to make creating graphs like this easier, but let’s stick with the data we currently have and just change ggplot().

Start with a normal line graph, this time looking at wine.

Then include a second geom_line() with its own aes() for the second variable.

A problem with this is that both lines are the same color. We need to set a color for each line, and do so within aes(). Instead of providing a color name, we need to provide the name the color will have in the legend. Do so for both lines.

We can change the legend title by using the function labs() and changing the value color to what we want the legend title to be.

Finally, a useful option to to move the legend from the side to the bottom is setting the theme() function to move the legend.position to “bottom”. This will allow the graph to be wider.

6.4 Scatter Plots

Making a scatter plot simply requires changing the x-axis from year to another numerical variable and using geom_point().

This graph shows us that when liquor consumption increases, beer consumption also tends to increase.

While scatterplots can help show the relationship between variables, we lose the information of how consumption changes over time.

6.5 Color blindness

Please keep in mind that some people are color blind so graphs (or maps which we will learn about soon) will be hard to read for these people if we choose the incorrect colors. A helpful site for choosing colors for graphs is colorbrewer2.org

This site let’s you select which type of colors you want (sequential and diverging such as shades in a hotspot map, and qualitative such as for data like what we used in this lesson). In the “Only show:” section you can set it to “colorblind safe” to restrict it to colors that allow people with color blindness to read your graph. To the right of this section it shows the HEX codes for each color (a HEX code is just a code that a computer can read and know exactly which color it is).

Let’s use an example of a color blind friendly color from the “qualitative” section of ColorBrewer. We have three options on this page (we can change how many colors we want but it defaults to showing 3): green (HEX = #1b9e77), orange (HEX = #d95f02), and purple (HEX = #7570b3). We’ll use the orange and purple colors. To manually set colors in ggplot() we use scale_color_manual(values = c()) and include a vector of color names or HEX codes inside the c(). Doing that using the orange and purple HEX codes will change our graph colors to these two colors.