15 More graphing with ggplot2
For this chapter you’ll need the following file, which is available for download here: fatal-police-shootings-data.csv.
In this lesson we will continue to explore graphing using ggplot2
. The data we will use is microdata on officer-involved shootings that resulted in a death in the United States since January 1st, 2015. This data has been compiled and released by The Washington Post so it will be a useful exercise in exploring data from non-government sources. This data is useful for our purposes as it has a number of variables related to the person who was shot, allowing us to practice making many types of graphs. Each row of data is a different person who was shot and killed by the police, and each column gives us information about the individual or the shooting, such as their age, whether they carried any weapon, and the shooting location.
To explore the data on The Washington Post’s website, see here. To examine their methodology, see here.
The data initially comes as a .csv file so we’ll use the read_csv()
function from the readr
package.
library(readr)
<- read_csv("data/fatal-police-shootings-data.csv") shootings
Since read_csv()
reads files into a tibble object, we’ll turn it into a data.frame so head()
shows every single column.
<- as.data.frame(shootings) shootings
15.1 Exploring data
Now that we have the data read in, let’s look at it.
ncol(shootings)
# [1] 14
nrow(shootings)
# [1] 4371
The data has 14 variables and covers 4,371 shootings. Let’s check out some of the variables, first using head()
then using summary()
and table()
.
head(shootings)
# id name date manner_of_death armed age gender race
# 1 3 Tim Elliot 2015-01-02 shot gun 53 M A
# 2 4 Lewis Lee Lembke 2015-01-02 shot gun 47 M W
# 3 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23 M H
# 4 8 Matthew Hoffman 2015-01-04 shot toy weapon 32 M W
# 5 9 Michael Rodriguez 2015-01-04 shot nail gun 39 M H
# 6 11 Kenneth Joe Brown 2015-01-04 shot gun 18 M W
# city state signs_of_mental_illness threat_level flee
# 1 Shelton WA TRUE attack Not fleeing
# 2 Aloha OR FALSE attack Not fleeing
# 3 Wichita KS FALSE other Not fleeing
# 4 San Francisco CA TRUE attack Not fleeing
# 5 Evans CO FALSE attack Not fleeing
# 6 Guthrie OK FALSE attack Not fleeing
# body_camera
# 1 FALSE
# 2 FALSE
# 3 FALSE
# 4 FALSE
# 5 FALSE
# 6 FALSE
Each row is a single shooting, and it includes variables such as the victim’s name, the date of the shooting, demographic information about that person, the city and state where the shooting occurred, and some information about the incident. It is clear from these first 6 rows that most variables are categorical so we can’t use summary()
on them. Let’s use summary()
on the date and age columns and then use table()
for the rest.
summary(shootings$date)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# "2015-01-02" "2016-02-07" "2017-03-16" "2017-03-18" "2018-04-11" "2019-06-25"
summary(shootings$age)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 6.00 27.00 35.00 36.84 45.00 91.00 182
From this we can see that the data is from early January through mid-2019.18 From the age column we can see that the average age is about 37 with most people around that range. Now we can use table()
to see how often each value appears in each variable. We don’t want to do this for city or name as there would be too many values, but it will work for the other columns. Let’s start with the “manner_of_death” column.
table(shootings$manner_of_death)
#
# shot shot and Tasered
# 4146 225
To turn these counts into percentages we can divide the results by the number of rows in our data and multiply by 100.
table(shootings$manner_of_death) / nrow(shootings) * 100
#
# shot shot and Tasered
# 94.852437 5.147563
Now it is clear to see that in about 95% of shootings, officers used a gun and in 5% of shootings they also used a Taser. As this is data on officer shooting deaths, this is unsurprising. Let’s take a look at whether the victim was armed.
table(shootings$armed) / nrow(shootings) * 100
#
# air conditioner ax
# 0.02287806 0.48043926
# barstool baseball bat
# 0.02287806 0.27453672
# baseball bat and bottle baseball bat and fireplace poker
# 0.02287806 0.02287806
# baton bayonet
# 0.09151224 0.02287806
# BB gun bean-bag gun
# 0.06863418 0.02287806
# beer bottle blunt object
# 0.06863418 0.11439030
# bow and arrow box cutter
# 0.02287806 0.22878060
# brick carjack
# 0.04575612 0.02287806
# chain chain saw
# 0.04575612 0.04575612
# chainsaw chair
# 0.02287806 0.04575612
# claimed to be armed contractor's level
# 0.02287806 0.02287806
# cordless drill crossbow
# 0.02287806 0.20590254
# crowbar fireworks
# 0.06863418 0.02287806
# flagpole flashlight
# 0.02287806 0.02287806
# garden tool glass shard
# 0.02287806 0.06863418
# gun gun and car
# 55.43353924 0.11439030
# gun and knife gun and sword
# 0.34317090 0.02287806
# gun and vehicle guns and explosives
# 0.04575612 0.06863418
# hammer hand torch
# 0.22878060 0.02287806
# hatchet hatchet and gun
# 0.18302448 0.04575612
# incendiary device knife
# 0.04575612 14.96225120
# lawn mower blade machete
# 0.04575612 0.86936628
# machete and gun meat cleaver
# 0.02287806 0.06863418
# metal hand tool metal object
# 0.02287806 0.09151224
# metal pipe metal pole
# 0.25165866 0.04575612
# metal rake metal stick
# 0.02287806 0.06863418
# motorcycle nail gun
# 0.02287806 0.02287806
# oar pellet gun
# 0.02287806 0.02287806
# pen pepper spray
# 0.02287806 0.02287806
# pick-axe piece of wood
# 0.06863418 0.06863418
# pipe pitchfork
# 0.13726836 0.04575612
# pole pole and knife
# 0.04575612 0.04575612
# rock samurai sword
# 0.09151224 0.02287806
# scissors screwdriver
# 0.06863418 0.18302448
# sharp object shovel
# 0.11439030 0.06863418
# spear stapler
# 0.02287806 0.02287806
# straight edge razor sword
# 0.06863418 0.34317090
# Taser tire iron
# 0.41180508 0.02287806
# toy weapon unarmed
# 3.54609929 6.36010066
# undetermined unknown weapon
# 4.30107527 1.25829330
# vehicle vehicle and gun
# 1.57858614 0.02287806
# walking stick wrench
# 0.02287806 0.02287806
This is fairly hard to interpret as it is sorted alphabetically when we’d prefer it to be sorted by most common weapon. It also doesn’t round the numbers so there are many numbers past the decimal point shown. Let’s solve these two issues using sort()
and round()
. We could just wrap our initial code inside each of these functions but to avoid making too complicated code, we save the results in a vector called “temp” and incrementally use sort()
and round()
on that. We’ll set the parameter decreasing
to TRUE in the sort()
function so that it is in descending order of how common each value is. And we’ll round to two decimal places by setting the parameter digits
to 2 in round()
.
<- table(shootings$armed) / nrow(shootings) * 100
temp <- sort(temp, decreasing = TRUE)
temp <- round(temp, digits = 2)
temp
temp#
# gun knife
# 55.43 14.96
# unarmed undetermined
# 6.36 4.30
# toy weapon vehicle
# 3.55 1.58
# unknown weapon machete
# 1.26 0.87
# ax Taser
# 0.48 0.41
# gun and knife sword
# 0.34 0.34
# baseball bat metal pipe
# 0.27 0.25
# box cutter hammer
# 0.23 0.23
# crossbow hatchet
# 0.21 0.18
# screwdriver pipe
# 0.18 0.14
# blunt object gun and car
# 0.11 0.11
# sharp object baton
# 0.11 0.09
# metal object rock
# 0.09 0.09
# BB gun beer bottle
# 0.07 0.07
# crowbar glass shard
# 0.07 0.07
# guns and explosives meat cleaver
# 0.07 0.07
# metal stick pick-axe
# 0.07 0.07
# piece of wood scissors
# 0.07 0.07
# shovel straight edge razor
# 0.07 0.07
# brick chain
# 0.05 0.05
# chain saw chair
# 0.05 0.05
# gun and vehicle hatchet and gun
# 0.05 0.05
# incendiary device lawn mower blade
# 0.05 0.05
# metal pole pitchfork
# 0.05 0.05
# pole pole and knife
# 0.05 0.05
# air conditioner barstool
# 0.02 0.02
# baseball bat and bottle baseball bat and fireplace poker
# 0.02 0.02
# bayonet bean-bag gun
# 0.02 0.02
# bow and arrow carjack
# 0.02 0.02
# chainsaw claimed to be armed
# 0.02 0.02
# contractor's level cordless drill
# 0.02 0.02
# fireworks flagpole
# 0.02 0.02
# flashlight garden tool
# 0.02 0.02
# gun and sword hand torch
# 0.02 0.02
# machete and gun metal hand tool
# 0.02 0.02
# metal rake motorcycle
# 0.02 0.02
# nail gun oar
# 0.02 0.02
# pellet gun pen
# 0.02 0.02
# pepper spray samurai sword
# 0.02 0.02
# spear stapler
# 0.02 0.02
# tire iron vehicle and gun
# 0.02 0.02
# walking stick wrench
# 0.02 0.02
Now it is a little easier to interpret. In over half of the cases the victim was carrying a gun. A knife was involved 15% of the time. And 6% of the time they were unarmed. In 4% of cases there is no data on any weapon. That leaves about 20% of cases where one of the many rare weapons were used, including some that overlap with one of the more common categories.
Think about how you’d graph this data. There are 85 unique values in this column though fewer than ten of them are common enough to appear more than 1% of the time. Should we graph all of them? No, that would overwhelm any graph. For a useful graph we would need to combine many of these into a single category - possibly called “other weapons.” And how do we deal with values where they could meet multiple larger categories? There is not always a clear answer for these types of questions. It depends on what data you’re interested in, the goal of the graph, the target audience, and personal preference.
Let’s keep exploring the data by looking at gender and race.
table(shootings$gender) / nrow(shootings) * 100
#
# F M
# 4.667124 95.218485
Nearly all of the shootings are of a man. Given that we saw most shootings involved a person with a weapon and that most violent crimes are committed by men, this shouldn’t be too surprising.
<- table(shootings$race) / nrow(shootings) * 100
temp <- sort(temp)
temp <- round(temp, digits = 2)
temp
temp#
# O N A H B W
# 0.87 1.46 1.62 16.45 22.90 44.89
White people are the largest race group that is killed by police, followed by Black people and Hispanic people. In fact, there are about twice as many White people killed than Black people killed, and about 2.5 times as many White people killed than Hispanic people killed. Does this mean that the oft-repeated claim that Black people are killed at disproportionate rates is wrong? No. This data simply shows the number of people killed; it doesn’t give any indication on rates of death per group. You’d need to merge it with Census data to get population to determine a rate per race group. And even that would be insufficient since people are, for example, stopped by police at different rates.
This data provides a lot of information on people killed by the police, but even so it is insufficient to answer many of the questions on that topic. It’s important to understand the data not only to be able to answer questions about it, but to know what questions you can’t answer - and you’ll find when using criminology data that there are a lot of questions that you can’t answer.19
One annoying thing with the gender and race variables is that they don’t spell out the name. Instead of “Female”, for example, it has “F”. For our graphs we want to spell out the words so it is clear to viewers. We’ll fix this issue, and the issue of having many weapon categories, as we graph each variable.
15.2 Graphing a single numeric variable
We’ve spent some time looking at the data so now we’re ready to make the graphs. We need to load the ggplot2
package if we haven’t done so already this session (i.e. since you last closed RStudio).
library(ggplot2)
# Warning: package 'ggplot2' was built under R version 4.2.2
As a reminder, the benefit of using ggplot()
is that we can start with a simple plot and build our way up to more complicated graphs. We’ll start here by building some graphs to depict a numeric variable - in this case the “age” column. We start every ggplot()
the same, by inserting the data set first and then put our x and y variables inside of the aes()
parameter. In this case we’re only going to be plotting an x variable so we don’t need to write anything for y.
ggplot(shootings, aes(x = age))
Running the above code returns a blank graph since we haven’t told ggplot()
what type of graph we want yet. Below are a few different types of ways to display a single numeric variable. They’re essentially all variations of each other and show the data at different levels of precision. It’s hard to say which is best - you’ll need to use your best judgment and consider your audience.
15.2.1 Histogram
The histogram is a very common type of graph for a single numeric variable. Histograms group a numeric variable into categories and then plot them, with the heights of each bar indicating how common the group is. We can make a histogram by adding geom_histogram()
to the ggplot()
. The results will print out the text “Warning: Removed 182 rows containing non-finite values (stat_bin).” which just means that 182 rows in our data have NA for age, and are excluded from the graph.
ggplot(shootings, aes(x = age)) +
geom_histogram()
# `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Warning: Removed 182 rows containing non-finite values (`stat_bin()`).
The x-axis is ages with each bar being a group of certain ages, and the y-axis is how many people are in each group. The grouping is done automatically and we can alter it by changing the bin
parameter in geom_histogram()
. By default this parameter is set to 30, but we can make each group smaller (have fewer ages per group) by increasing it from 30 or make each group larger by decreasing it.
ggplot(shootings, aes(x = age)) +
geom_histogram(bins = 15)
# Warning: Removed 182 rows containing non-finite values (`stat_bin()`).
ggplot(shootings, aes(x = age)) +
geom_histogram(bins = 45)
# Warning: Removed 182 rows containing non-finite values (`stat_bin()`).
Note that while the overall trend (of most deaths being around age 25) doesn’t change when we alter bin
, the data gets more or less precise. Having fewer bins means fewer, but larger, bars which can obscure trends that more, smaller, bars would show. But having too many bars may make you focus on minor variations that could occur randomly and take away attention from the overall trend. I prefer to err on the side of more precise graphs (more, smaller bars) but be careful over-interpreting data from small groups.
These graphs show the y-axis as the number of people in each bar. If we want to show proportions instead, we can add in a parameter for y
in the aes()
of the geom_histogram()
. We add in y = (..count..)/sum(..count..))
, which automatically converts the counts to proportions. The (..count..)/sum(..count..))
stuff is just taking each group and dividing it from the sum of all groups. You could, of course, do this yourself before making the graph, but it’s an easy helper. If you do this, make sure to relabel the y-axis so you don’t accidentally call the proportions a count.
ggplot(shootings, aes(x = age)) +
geom_histogram(aes(y = (..count..)/sum(..count..)))
# Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
# ℹ Please use `after_stat(count)` instead.
# `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Warning: Removed 182 rows containing non-finite values (`stat_bin()`).
15.2.2 Density plot
Density plots are essentially smoothed versions of histograms. They’re especially useful for numeric variables, which are not integers (integers are whole numbers). They’re also useful when you want to be more precise than a histogram as they are - to simplify - histograms where each bar is very narrow. Note that the y-axis of a density plot is automatically labeled “density” and has very small numbers. Interpreting the y-axis is fairly hard to explain to someone not familiar with statistics so I’d caution against using this graph unless your audience is already familiar with it.
To interpret these kinds of graphs, I recommend looking for trends rather than trying to identify specific points. For example, in the below graph we can see that shootings rise rapidly starting around age 10, peak at around age 30 (if we were presenting this graph to other people we’d probably want more ages shown on the x-axis), and then steadily decline until about age 80 where it’s nearly flat.
ggplot(shootings, aes(x = age)) +
geom_density()
# Warning: Removed 182 rows containing non-finite values (`stat_density()`).
15.2.3 Count graph
A count graph is essentially a histogram with a bar for every value in the numeric variable - like a less-smooth density plot. Note that this won’t work well if you have too many unique values so I’d strongly recommend rounding the data to the nearest whole number first if you don’t already have an integer. Our age variable is already rounded so we don’t need to do that. To make a count graph, we add stat_count()
to the ggplot()
.
ggplot(shootings, aes(x = age)) +
stat_count()
# Warning: Removed 182 rows containing non-finite values (`stat_count()`).
Now we have a single bar for every age in the data. Like the histogram, the y-axis shows the number of people that are that age. And like the histogram, we can change this from number of people to proportion of people using the exact same code.
ggplot(shootings, aes(x = age)) +
stat_count(aes(y = (..count..)/sum(..count..)))
# Warning: Removed 182 rows containing non-finite values (`stat_count()`).
15.3 Graphing a categorical variable
15.3.1 Bar graph
To make this barplot we’ll set the x-axis variable to our “race” column and add geom_bar()
to the end.
ggplot(shootings, aes(x = race)) +
geom_bar()
This gives us a barplot in alphabetical order. In most cases we want the data sorted by frequency, so we can easily see which value is the most common, second most common, etc. There are a few ways to do this, but we’ll do this by turning the “race” variable into a factor and ordering it by frequency. We can do that using the factor()
function. The first input will be the “race” variable, and then we will need to set the levels
parameter to a vector of values sorted by frequency. An easy way to know how often values are in a column is to use the table()
function on that column, such as below.
table(shootings$race)
#
# A B H N O W
# 71 1001 719 64 38 1962
It’s still alphabetical so let’s wrap that in a sort()
function.
sort(table(shootings$race))
#
# O N A H B W
# 38 64 71 719 1001 1962
It’s sorted from smallest to largest. We usually want to graph from largest to smallest so let’s set the parameter decreasing
in sort()
to TRUE.
sort(table(shootings$race), decreasing = TRUE)
#
# W B H A N O
# 1962 1001 719 71 64 38
Now, we only need the names of each value, not how often they occur. So we can against wrap this whole thing in names()
to get just the names.
names(sort(table(shootings$race), decreasing = TRUE))
# [1] "W" "B" "H" "A" "N" "O"
If we tie it all together, we can make the “race” column into a factor variable.
$race <- factor(shootings$race,
shootingslevels = names(sort(table(shootings$race),
decreasing = TRUE)))
Now let’s try that barplot again.
ggplot(shootings, aes(x = race)) +
geom_bar()
It works! Note that all the values that are missing in our data are still reported in the barplot under a column called “NA.” This is not sorted properly since there are more NA values than three of the other values, but NA is still at the far right of the graph. We can change this if we want to make all the NA values an actual character type and call it something like “Unknown.” But this way it does draw attention to how many values are missing from this column. Like most things in graphing, this is a personal choice as to what to do.
For bar graphs it is often useful to flip the graph so each value is a row in the graph rather than a column. This also makes it much easier to read the value name. If the value names are long, it’ll shrink the graph to accommodate the name. This is usually a sign that you should try to shorten the name to avoid reducing the size of the graph.
ggplot(shootings, aes(x = race)) +
geom_bar() +
coord_flip()
Since it’s flipped, now it’s sorted from smallest to largest. So we’ll need to change the factor()
code to fix that by making the decreasing
parameter in sort()
FALSE.
$race <- factor(shootings$race,
shootingslevels = names(sort(table(shootings$race),
decreasing = FALSE)))
ggplot(shootings, aes(x = race)) +
geom_bar() +
coord_flip()
The NA value is now at the top, which looks fairly bad. Let’s change all NA values to the string “Unknown”. And while we’re at it, let’s change all the abbreviated race values to actual names. We can get all the NA values by using is.na(shootings$race)
and using a conditional statement to get all rows that meet that condition, then assign them the value “Unknown”. Instead of trying to subset a factor variable to change the values, we should convert it back to a character type first using as.character()
, and then convert it to a factor again once we’re done.
$race <- as.character(shootings$race)
shootings$race[is.na(shootings$race)] <- "Unknown" shootings
Now we can use conditional statements to change all the race letters to names. It’s not clear what race “O” and “N” are so I checked The Washington Post’s GitHub which explains what they mean. Instead of is.na()
we’ll use shootings$race == ""
, where we put the letter inside of the quotes.
$race[shootings$race == "O"] <- "Other"
shootings$race[shootings$race == "N"] <- "Native American"
shootings$race[shootings$race == "A"] <- "Asian"
shootings$race[shootings$race == "H"] <- "Hispanic"
shootings$race[shootings$race == "B"] <- "Black"
shootings$race[shootings$race == "W"] <- "White" shootings
Now let’s see how our graph looks. We’ll need to rerun the factor()
code since now all of the values are changed.
$race <- factor(shootings$race,
shootingslevels = names(sort(table(shootings$race),
decreasing = FALSE)))
ggplot(shootings, aes(x = race)) +
geom_bar() +
coord_flip()
As earlier, we can show proportion instead of count by adding y = (..count..)/sum(..count..)
to the aes()
in geom_bar()
.
ggplot(shootings, aes(x = race)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
coord_flip()
15.4 Graphing data over time
We went over time-series graphs in Chapter 14, but it’s such an important topic we’ll cover it again. A lot of criminology research is seeing if a policy had an effect, which means we generally want to compare an outcome over time (and compare the treated group to a similar untreated group). To graph that we look at an outcome - in this case numbers of killings - over time. In our case we aren’t evaluating any policy, just seeing if the number of police killings change over time.
We’ll need to make a variable to indicate that the row is for one shooting. We can call this “dummy” and assign it a value of 1. Then we can make the ggplot()
and set this “dummy” column to the y-axis value and set our date variable “date” to the x-axis (the time variable is always on the x-axis). Then we’ll set the type of plot to geom_line()
so we have a line graph showing killings over time.
$dummy <- 1
shootingsggplot(shootings, aes(x = date,
y = dummy)) +
geom_line()
This graph is clearly wrong. Why? Well, our y-axis variable is always 1 so there’s no variation to plot. Every single value, even if there are more than one shooting per day, is on the 1 line on the y-axis. And the fact that we have multiple killings per day is an issue because we only want a single line in our graph. We’ll need to aggregate our data to some time period (e.g. day, month, year) so that we have one row per time-period and know how many people were killed in that period. We’ll start with yearly data and then move to monthly data. Since we’re going to be dealing with dates, let’s use the lubridate()
package that is well-suited for this task.
install.packages("lubridate")
library(lubridate)
# Warning: package 'lubridate' was built under R version 4.2.2
# Loading required package: timechange
# Warning: package 'timechange' was built under R version 4.2.2
#
# Attaching package: 'lubridate'
# The following objects are masked from 'package:base':
#
# date, intersect, setdiff, union
We’ll use two functions to create variables that tell us the month and the year of each date in our data. We’ll use these new variables to aggregate our data to that time unit. First, the floor_date()
function is a very useful tool that essentially rounds a date. Here we have the exact date the killing happened on, and we want to determine what month that date is from. So we’ll use the parameter unit
in floor_date()
and tell the function we want to know the “month” (for a full set of options please see the documentation for floor_date()
by entering ?floor_date
in the console). So we can do floor_date(shootings$date, unit = "month")
to get the month - specifically, it returns the date that is the first of the month for that month - in which the killing happened. Even simpler, to get the year, we simply use year()
and put our “date” variable in the parentheses. We’ll call the new variables “month_year” and “year”, respectively.
$month_year <- floor_date(shootings$date, unit = "month")
shootings$year <- year(shootings$date)
shootings
head(shootings$month_year)
# [1] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
# [6] "2015-01-01"
head(shootings$year)
# [1] 2015 2015 2015 2015 2015 2015
Since the data is already sorted by date, all the values printed from head()
are the same. But you can look at the data using View()
to confirm that the code worked properly.
We can now aggregate the data by the “month_year” variable and assign the result to a new data set we’ll call monthly_shootings. We’ll use the group_by()
and summarize()
functions from dplyr
that were introduced in Chapter 11 to do this. And we’ll use the pipe method of writing dplyr
code that was discussed in Section 11.4. Since the “dummy” column has a value of 1 for each shooting, we’ll sum up this column to get the number of shootings each month/year.
library(dplyr)
#
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#
# filter, lag
# The following objects are masked from 'package:base':
#
# intersect, setdiff, setequal, union
<- shootings %>%
monthly_shootings group_by(month_year) %>%
summarize(dummy = sum(dummy))
head(monthly_shootings)
# # A tibble: 6 × 2
# month_year dummy
# <date> <dbl>
# 1 2015-01-01 76
# 2 2015-02-01 77
# 3 2015-03-01 92
# 4 2015-04-01 84
# 5 2015-05-01 71
# 6 2015-06-01 65
Since we now have a variable that shows the number of people killed each month, we can graph this new data set. We’ll use the same process as earlier, but our data set is now monthly_shootings
instead of shootings
and the x-axis variable is “month_year” instead of “date”.
ggplot(monthly_shootings, aes(x = month_year,
y = dummy)) +
geom_line()
The process is the same for yearly data.
<- shootings %>%
yearly_shootings group_by(year) %>%
summarize(dummy = sum(dummy))
ggplot(yearly_shootings, aes(x = year,
y = dummy)) +
geom_line()
Note the steep drop-off at the end of each graph. Is that due to fewer shooting occurring more recently? No, it’s simply an artifact of the graph comparing whole months (years) to parts of a month (year) since we haven’t finished this month (year) yet.
15.5 Pretty graphs
What’s next for these graphs? You’ll likely want to add labels for the axes and the title. We went over how to do this in Section 14.3 so please refer to that for more info. Also, check out ggplot2
’s website to see more on this very versatile package. As I’ve said all chapter, a lot of this is going to be personal taste so please spend some time exploring the package and changing the appearance of the graph to learn what looks right to you.
15.5.1 Themes
In addition to making changes to the graph’s appearance yourself, you can use a theme that someone else made. A theme is just a collection of changes to the graph’s appearance that someone put in a function for others to use. Each theme is different and is fairly opinionated, so you should only use one that you think looks best for your graph. To use a theme, simply add the function for that theme to your ggplot using the +
as normal. ggplot2
comes with a series of themes that you can look at here.
Here, we’ll be looking at themes from the ggthemes
package, which is a great source of different themes to modify the appearance of your graph. Check out this website to see a depiction of all of the possible themes. If you don’t have the ggthemes
package installed, do so using install.packages("ggthemes")
.
install.packages("ggthemes")
Let’s do a few examples using the graph made above. First, we’ll need to load the ggthemes
library.
library(ggthemes)
ggplot(yearly_shootings, aes(x = year,
y = dummy)) +
geom_line() +
theme_fivethirtyeight()
ggplot(yearly_shootings, aes(x = year,
y = dummy)) +
geom_line() +
theme_tufte()
ggplot(yearly_shootings, aes(x = year,
y = dummy)) +
geom_line() +
theme_few()
ggplot(yearly_shootings, aes(x = year,
y = dummy)) +
geom_line() +
theme_excel()
The Washington Post is continuing to collect this data so if you look on their site you’ll find more up-to-date data.↩︎
It is especially important to not overreach when trying to answer a question when the data can’t do it well. Often, no answer is better than a wrong one - especially in a field with serious consequences like criminology. This isn’t to say that you should never try to answer questions since no data is perfect and you may be wrong. You should try to develop a deep understanding of the data and be confident that you can actually answer those questions with confidence.↩︎