7 More graphing with ggplot2

In this lesson we will continue to explore graphing using ggplot(). The data we will use is a database of officer-involved shootings that result in a death in the United States since January 1st, 2015. This data has been compiled and released by the Washington Post so it will be a useful exercise in exploring data from non-government sources. This data is useful for our purposes as it has a number of variables related to the person who was shot, allowing us to practice making many types of graphs.

To explore the data on their website, see here. To examine their methodology, see here.

The data initially comes as a .csv file so we’ll use the read_csv() function from the readr package. Since it’s available on GitHub, we can download it by directing read_csv() to read the file at its URL on GitHub.

library(readr)
shootings <- read_csv("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_double(),
#>   name = col_character(),
#>   date = col_date(format = ""),
#>   manner_of_death = col_character(),
#>   armed = col_character(),
#>   age = col_double(),
#>   gender = col_character(),
#>   race = col_character(),
#>   city = col_character(),
#>   state = col_character(),
#>   signs_of_mental_illness = col_logical(),
#>   threat_level = col_character(),
#>   flee = col_character(),
#>   body_camera = col_logical()
#> )

Since read_csv() reads files into a tibble object, we’ll turn it into a data.frame so head() shows every single column.

shootings <- as.data.frame(shootings)

7.1 Exploring Data

Now that we have the data read in, let’s look at it.

nrow(shootings)
#> [1] 5573
ncol(shootings)
#> [1] 14

The data has 14 variables and covers 5573 shootings. Let’s check out some of the variables, first using head() then using summary() and table().

head(shootings)
#>   id               name       date  manner_of_death      armed age gender race
#> 1  3         Tim Elliot 2015-01-02             shot        gun  53      M    A
#> 2  4   Lewis Lee Lembke 2015-01-02             shot        gun  47      M    W
#> 3  5 John Paul Quintero 2015-01-03 shot and Tasered    unarmed  23      M    H
#> 4  8    Matthew Hoffman 2015-01-04             shot toy weapon  32      M    W
#> 5  9  Michael Rodriguez 2015-01-04             shot   nail gun  39      M    H
#> 6 11  Kenneth Joe Brown 2015-01-04             shot        gun  18      M    W
#>            city state signs_of_mental_illness threat_level        flee
#> 1       Shelton    WA                    TRUE       attack Not fleeing
#> 2         Aloha    OR                   FALSE       attack Not fleeing
#> 3       Wichita    KS                   FALSE        other Not fleeing
#> 4 San Francisco    CA                    TRUE       attack Not fleeing
#> 5         Evans    CO                   FALSE       attack Not fleeing
#> 6       Guthrie    OK                   FALSE       attack Not fleeing
#>   body_camera
#> 1       FALSE
#> 2       FALSE
#> 3       FALSE
#> 4       FALSE
#> 5       FALSE
#> 6       FALSE

Each row is a single shooting and it includes variables such as the victim’s name, the date of the shooting, demographic information about that person, the city and state where the shooting occurred, and some information about the incident. It is clear from these first 6 rows that most variables are categorical so we can’t use summary() on them. Let’s use summary() on the date and age columns and then use table() for the rest.

summary(shootings$date)
#>         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
#> "2015-01-02" "2016-05-29" "2017-11-03" "2017-10-27" "2019-03-30" "2020-08-18"
summary(shootings$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    6.00   27.00   35.00   37.12   46.00   91.00     248

From this we can see that the data is from early January through about a week ago. From the age column we can see that the average age is about 37 with most people around that range. Now we can use table() to see how often each value appears in each variable. We don’t want to do this for city or name as there would be too many values, but it will work for the other columns. Let’s start with the “manner_of_death” column.

table(shootings$manner_of_death)
#> 
#>             shot shot and Tasered 
#>             5296              277

To turn these counts into percentages we can divide the results by the number of rows in our data and multiply by 100.

table(shootings$manner_of_death) / nrow(shootings) * 100
#> 
#>             shot shot and Tasered 
#>        95.029607         4.970393

Now it is clear to see that in about 95% of shootings, officers used a gun and in 5% of shootings they also used a Taser. As this is data on officer shooting deaths, this is unsurprising. Let’s take a look at whether the victim was armed.

table(shootings$armed) / nrow(shootings) * 100
#> 
#>                  air conditioner                       air pistol 
#>                       0.01794366                       0.01794366 
#>                   Airsoft pistol                               ax 
#>                       0.01794366                       0.43064777 
#>                         barstool                     baseball bat 
#>                       0.01794366                       0.32298582 
#>          baseball bat and bottle baseball bat and fireplace poker 
#>                       0.01794366                       0.01794366 
#>           baseball bat and knife                            baton 
#>                       0.01794366                       0.08971828 
#>                          bayonet                           BB gun 
#>                       0.01794366                       0.08971828 
#>               BB gun and vehicle                     bean-bag gun 
#>                       0.01794366                       0.01794366 
#>                      beer bottle                     blunt object 
#>                       0.05383097                       0.08971828 
#>                           bottle                    bow and arrow 
#>                       0.01794366                       0.01794366 
#>                       box cutter                            brick 
#>                       0.21532388                       0.03588731 
#>              car, knife and mace                          carjack 
#>                       0.01794366                       0.01794366 
#>                            chain                        chain saw 
#>                       0.05383097                       0.03588731 
#>                         chainsaw                            chair 
#>                       0.01794366                       0.07177463 
#>              claimed to be armed               contractor's level 
#>                       0.01794366                       0.01794366 
#>                   cordless drill                         crossbow 
#>                       0.01794366                       0.16149291 
#>                          crowbar                        fireworks 
#>                       0.07177463                       0.01794366 
#>                         flagpole                       flashlight 
#>                       0.01794366                       0.03588731 
#>                      garden tool                      glass shard 
#>                       0.03588731                       0.05383097 
#>                          grenade                              gun 
#>                       0.01794366                      56.64812489 
#>                      gun and car                    gun and knife 
#>                       0.19738023                       0.32298582 
#>                    gun and sword                  gun and vehicle 
#>                       0.01794366                       0.19738023 
#>              guns and explosives                           hammer 
#>                       0.05383097                       0.28709851 
#>                       hand torch                          hatchet 
#>                       0.01794366                       0.19738023 
#>                  hatchet and gun                         ice pick 
#>                       0.03588731                       0.01794366 
#>                incendiary device                            knife 
#>                       0.03588731                      14.76762964 
#>                 lawn mower blade                          machete 
#>                       0.03588731                       0.84335188 
#>                  machete and gun                     meat cleaver 
#>                       0.01794366                       0.08971828 
#>                  metal hand tool                     metal object 
#>                       0.01794366                       0.07177463 
#>                       metal pipe                       metal pole 
#>                       0.25121120                       0.05383097 
#>                       metal rake                      metal stick 
#>                       0.01794366                       0.05383097 
#>                       motorcycle                         nail gun 
#>                       0.01794366                       0.01794366 
#>                              oar                       pellet gun 
#>                       0.01794366                       0.05383097 
#>                              pen                     pepper spray 
#>                       0.01794366                       0.01794366 
#>                         pick-axe                    piece of wood 
#>                       0.07177463                       0.08971828 
#>                             pipe                        pitchfork 
#>                       0.10766194                       0.03588731 
#>                             pole                   pole and knife 
#>                       0.03588731                       0.03588731 
#>                             rock                    samurai sword 
#>                       0.10766194                       0.05383097 
#>                         scissors                      screwdriver 
#>                       0.12560560                       0.23326754 
#>                     sharp object                           shovel 
#>                       0.23326754                       0.10766194 
#>                            spear                          stapler 
#>                       0.01794366                       0.01794366 
#>              straight edge razor                            sword 
#>                       0.07177463                       0.41270411 
#>                            Taser                        tire iron 
#>                       0.46653508                       0.01794366 
#>                       toy weapon                          unarmed 
#>                       3.46312579                       6.38794186 
#>                     undetermined                   unknown weapon 
#>                       3.03247802                       1.41754890 
#>                          vehicle                  vehicle and gun 
#>                       2.85304145                       0.07177463 
#>              vehicle and machete                    walking stick 
#>                       0.01794366                       0.01794366 
#>                       wasp spray                           wrench 
#>                       0.01794366                       0.01794366

This is fairly hard to interpret as it is sorted alphabetically when we’d prefer it to be sorted by most common weapon. It also doesn’t round the numbers so there are many numbers past the decimal point shown. Let’s solve these two issues using sort() and round(). We could just wrap our initial code inside each of these functions but to avoid making too complicated code, we save the results in a temp object and incrementally use sort() and round() on that. We’ll set the parameter decreasing to TRUE in the sort() function so that it is in descending order of how common each value is. And we’ll round to two decimal places by setting the parameter digits to 2.

temp <- table(shootings$armed) / nrow(shootings) * 100
temp <- sort(temp, decreasing = TRUE)
temp <- round(temp, digits = 2)
temp
#> 
#>                              gun                            knife 
#>                            56.65                            14.77 
#>                          unarmed                       toy weapon 
#>                             6.39                             3.46 
#>                     undetermined                          vehicle 
#>                             3.03                             2.85 
#>                   unknown weapon                          machete 
#>                             1.42                             0.84 
#>                            Taser                               ax 
#>                             0.47                             0.43 
#>                            sword                     baseball bat 
#>                             0.41                             0.32 
#>                    gun and knife                           hammer 
#>                             0.32                             0.29 
#>                       metal pipe                      screwdriver 
#>                             0.25                             0.23 
#>                     sharp object                       box cutter 
#>                             0.23                             0.22 
#>                      gun and car                  gun and vehicle 
#>                             0.20                             0.20 
#>                          hatchet                         crossbow 
#>                             0.20                             0.16 
#>                         scissors                             pipe 
#>                             0.13                             0.11 
#>                             rock                           shovel 
#>                             0.11                             0.11 
#>                            baton                           BB gun 
#>                             0.09                             0.09 
#>                     blunt object                     meat cleaver 
#>                             0.09                             0.09 
#>                    piece of wood                            chair 
#>                             0.09                             0.07 
#>                          crowbar                     metal object 
#>                             0.07                             0.07 
#>                         pick-axe              straight edge razor 
#>                             0.07                             0.07 
#>                  vehicle and gun                      beer bottle 
#>                             0.07                             0.05 
#>                            chain                      glass shard 
#>                             0.05                             0.05 
#>              guns and explosives                       metal pole 
#>                             0.05                             0.05 
#>                      metal stick                       pellet gun 
#>                             0.05                             0.05 
#>                    samurai sword                            brick 
#>                             0.05                             0.04 
#>                        chain saw                       flashlight 
#>                             0.04                             0.04 
#>                      garden tool                  hatchet and gun 
#>                             0.04                             0.04 
#>                incendiary device                 lawn mower blade 
#>                             0.04                             0.04 
#>                        pitchfork                             pole 
#>                             0.04                             0.04 
#>                   pole and knife                  air conditioner 
#>                             0.04                             0.02 
#>                       air pistol                   Airsoft pistol 
#>                             0.02                             0.02 
#>                         barstool          baseball bat and bottle 
#>                             0.02                             0.02 
#> baseball bat and fireplace poker           baseball bat and knife 
#>                             0.02                             0.02 
#>                          bayonet               BB gun and vehicle 
#>                             0.02                             0.02 
#>                     bean-bag gun                           bottle 
#>                             0.02                             0.02 
#>                    bow and arrow              car, knife and mace 
#>                             0.02                             0.02 
#>                          carjack                         chainsaw 
#>                             0.02                             0.02 
#>              claimed to be armed               contractor's level 
#>                             0.02                             0.02 
#>                   cordless drill                        fireworks 
#>                             0.02                             0.02 
#>                         flagpole                          grenade 
#>                             0.02                             0.02 
#>                    gun and sword                       hand torch 
#>                             0.02                             0.02 
#>                         ice pick                  machete and gun 
#>                             0.02                             0.02 
#>                  metal hand tool                       metal rake 
#>                             0.02                             0.02 
#>                       motorcycle                         nail gun 
#>                             0.02                             0.02 
#>                              oar                              pen 
#>                             0.02                             0.02 
#>                     pepper spray                            spear 
#>                             0.02                             0.02 
#>                          stapler                        tire iron 
#>                             0.02                             0.02 
#>              vehicle and machete                    walking stick 
#>                             0.02                             0.02 
#>                       wasp spray                           wrench 
#>                             0.02                             0.02

Now it is a little easier to interpret. In over half of the cases the victim was carrying a gun. 15% of the time they had a knife. And 6% of the time they were unarmed. In 4% of cases there is no data on any weapon. That leaves about 20% of cases where one of the many rare weapons were used, including some that overlap with one of the more common categories.

Think about how you’d graph this data. There are 95 unique values in this column though fewer than ten of them are common enough to appear more than 1% of the time. Should we graph all of them? No, that would overwhelm any graph. For a useful graph we would need to combine many of these into a single category - possibly called “other weapons.” And how do we deal with values where they could meet multiple larger categories? There is not always a clear answer for these types of questions. It depends on what data you’re interested in, the goal of the graph, the target audience, and personal preference.

Let’s keep exploring the data by looking at gender and race.

table(shootings$gender) / nrow(shootings) * 100
#> 
#>        F        M 
#>  4.41414 95.56792

Nearly all of the shootings are of a man. Given that we saw most shootings involved a person with a weapon and that most violent crimes are committed by men, this shouldn’t be too surprising.

temp <- table(shootings$race) / nrow(shootings) * 100
temp <- sort(temp)
temp <- round(temp, digits = 2)
temp
#> 
#>     O     N     A     H     B     W 
#>  0.86  1.40  1.69 16.45 23.60 45.27

White people are the largest race group that is killed by police, followed by Black people and Hispanic people. In fact, there are about twice as many White people killed than Black people killed, and about 2.5 times as many White people killed than Hispanic people killed. Does this mean that the oft-repeated claim that Black people are killed at disproportionate rates is wrong? No. This data simply shows the number of people killed; it doesn’t give any indication on rates of death per group. You’d need to merge it with Census data to get population to determine a rate per race group. And even that would be insufficient since people are, for example, stopped by police at different rates. This data provides a lot of information on people killed by the police, but even so it is insufficient to answer many of the questions on that topic. It’s important to understand the data not only to be able to answer questions about it, but to know what questions you can’t answer - and you’ll find when using criminology data that there are a lot of questions that you can’t answer.2

One annoying thing with the gender and race variables is that they don’t spell out the name. Instead of “Female”, for example, it has “F”. For our graphs we want to spell out the words so it is clear to viewers. We’ll fix this issue, and the issue of having many weapon categories, as we graph each variable.

7.2 Graphing a Single Numeric Variable

We’ve spent some time looking at the data so now we’re ready to make the graphs. We need to load the ggplot2 package if we haven’t done so already this session (i.e. since you last closed RStudio).

library(ggplot2)

As a reminder, the benefit of using ggplot() is we can start with a simple plot and build our way up to more complicated graphs. We’ll start here by building some graphs to depict a numeric variable - in this case the “age” column. We start every ggplot() the same, by inserting the dataset first and then put our x and y variables inside of the aes() parameter. In this case we’re only going to be plotting an x variable so we don’t need to write anything for y.

ggplot(shootings, aes(x = age))

Running the above code returns a blank graph since we haven’t told ggplot() what type of graph we want yet. Below are a few different types of ways to display a single numeric variable. They’re essentially all variations of each other and show the data at different levels of precision. It’s hard to say which is best - you’ll need to use your best judgment and consider your audience.

7.2.1 Histogram

The histogram is a very common type of graph for a single numeric variable. Histograms group a numeric variable into categories and then plot then, with the heights of each bar indicating how common the group is. We can make a histogram by adding geom_histogram() to the ggplot().

ggplot(shootings, aes(x = age)) + 
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 248 rows containing non-finite values (stat_bin).

The x-axis is ages with each bar being a group of certain ages, and the y-axis is how many people are in each group. The grouping is done automatically and we can alter it by changing the bin parameter in geom_histogram(). By default this parameter is set to 30 but we can make each group smaller (have fewer ages per group) by increasing it from 30 or make each group larger by decreasing it.

ggplot(shootings, aes(x = age)) + 
  geom_histogram(bins = 15)
#> Warning: Removed 248 rows containing non-finite values (stat_bin).

ggplot(shootings, aes(x = age)) + 
  geom_histogram(bins = 45)
#> Warning: Removed 248 rows containing non-finite values (stat_bin).

Note that while the overall trend (of most deaths being around age 25) doesn’t change when we alter bin, the data gets more or less precise. Having fewer bins means fewer, but larger, bars which can obscure trends that more, smaller, bars would show. But having too many bars may make you focus on minor variations that could occur randomly and take away attention from the overall trend. I prefer to err on the side of more precise graphs (more, smaller bars) but be careful over-interpreting data from small groups.

These graphs show the y-axis as the number of people in each bar. If we want to show percent instead, we can add in a parameter for y in the aes() of the geom_histogram(). We add in y = (..count..)/sum(..count..)) which automatically converts the counts to percentages. The “(..count..)/sum(..count..))” stuff is just taking each group and dividing it from the sum of all groups. You could, of course, do this yourself before making the graph, but it’s an easy helper. If you do this, make sure to relabel the y-axis so you don’t accidentally call the percent a count!

ggplot(shootings, aes(x = age)) + 
  geom_histogram(aes(y = (..count..)/sum(..count..)))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 248 rows containing non-finite values (stat_bin).

7.2.2 Density plot

Density plots are essentially smoothed versions of histograms. They’re especially useful for numeric variables which are not integers (integers are whole numbers). They’re also useful when you want to be more precise than a histogram as they are - to simplify - histograms where each bar is very narrow. Note that the y-axis of a density plot is automatically labeled “density” and has very small numbers. Interpreting the y-axis is fairly hard to explain to someone not familiar with statistics so I’d caution against using this graph unless your audience is already familiar with it. To interpret these kinds of graphs, I recommend looking for trends rather than trying to identify specific points. For example, in the below graph we can see that shootings rise rapidly starting around age 10, peak at around age 30 (if we were presenting this graph to other people we’d probably want more ages shown on the x-axis), and then steadily decline until about age 80 where it’s nearly flat.

ggplot(shootings, aes(x = age)) + 
  geom_density()
#> Warning: Removed 248 rows containing non-finite values (stat_density).

7.2.3 Count Graph

A count graph is essentially a histogram with a bar for every value in the numeric variable - like a less-smooth density plot. Note that this won’t work well if you have too many unique values so I’d strongly recommend rounding the data to the nearest whole number first. Our age variable is already rounded so we don’t need to do that. To make a count graph, we add stat_count() to the ggplot().

ggplot(shootings, aes(x = age)) + 
  stat_count()
#> Warning: Removed 248 rows containing non-finite values (stat_count).

Now we have a single bar for every age in the data. Like the histogram, the y-axis shows the number of people that are that age. And like the histogram, we can change this from number of people to percent of people using the exact same code.

ggplot(shootings, aes(x = age)) + 
  stat_count(aes(y = (..count..)/sum(..count..)))
#> Warning: Removed 248 rows containing non-finite values (stat_count).

7.2.4 Graphing a Categorical Variable

7.3 Bar graph

To make this barplot we’ll set the x-axis variable to our “race” column and add geom_bar() to the end.

ggplot(shootings, aes(x = race)) + 
  geom_bar()

This gives us a barplot in alphabetical order. In most cases we want the data sorted by frequency, so we can easily see what value is the most common, second most common, etc. There are a few ways to do this but we’ll do this by turning the “race” variable into a factor and ordering it by frequency. We can do that using the factor() function. The first input will be the “race” variable and then we will need to set the levels parameter to a vector of values sorted by frequency. An easy way to know how often values are in a column is to use the table() function on that column, such as below.

table(shootings$race)
#> 
#>    A    B    H    N    O    W 
#>   94 1315  917   78   48 2523

It’s still alphabetical so let’s wrap that in a sort() function.

sort(table(shootings$race))
#> 
#>    O    N    A    H    B    W 
#>   48   78   94  917 1315 2523

It’s sorted from smallest to largest. We usually want to graph from largest to smallest so let’s set the parameter decreasing in sort() to TRUE.

sort(table(shootings$race), decreasing = TRUE)
#> 
#>    W    B    H    A    N    O 
#> 2523 1315  917   94   78   48

Now, we only need the names of each value, not how often they occur. So we can against wrap this whole thing in names() to get just the names.

names(sort(table(shootings$race), decreasing = TRUE))
#> [1] "W" "B" "H" "A" "N" "O"

If we tie it all together, we can make the “race” column into a factor variable.

shootings$race <- factor(shootings$race,
                         levels = names(sort(table(shootings$race), decreasing = TRUE)))

Now let’s try that barplot again.

ggplot(shootings, aes(x = race)) + 
  geom_bar() 

It works! Note that all the values that are missing in our data are still reported in the barplot under a column called “NA”. This is not sorted properly since there are more NA values than three of the other values but is still at the far right of the graph. We can change this if we want to make all the NA values an actual character type and call it something like “Unknown”. But this way it does draw attention to how many values are missing from this column. Like most things in graphing, this is a personal choice as to what to do.

For bar graphs it is often useful to flip the graph so each value is a row in the graph rather than a column. This also makes it much easier to read the value name. If the value names are long, it’ll shrink the graph to accommodate the name. This is usually a sign that you should try to shorten the name to avoid reducing the size of the graph.

ggplot(shootings, aes(x = race)) + 
  geom_bar() +
  coord_flip() 

Since it’s flipped, now it’s sorted from smallest to largest. So we’ll need to change the factor() code to fix that.

shootings$race <- factor(shootings$race,
                         levels = names(sort(table(shootings$race), decreasing = FALSE)))
ggplot(shootings, aes(x = race)) + 
  geom_bar() +
  coord_flip() 

The NA value is now at the top, which looks fairly bad. Let’s change all NA values to the string “Unknown”. And while we’re at it, let’s change all the abbreviated race values to actual names. We can get all the NA values by using is.na(shootings$race) and using a conditional statement to get all rows that meet that condition, then assign them the value “Unknown”. Instead of trying to subset a factor variable to change the values, we should convert it back to a character type first using as.character(), and then convert it to a factor again once we’re done.

shootings$race <- as.character(shootings$race)
shootings$race[is.na(shootings$race)] <- "Unknown"

Now we can use conditional statements to change all the race letters to names. It’s not clear what race “O” and “N” are so I checked the Washington Post’s GitHub page which explains. Instead of is.na() we’ll use shootings$race == "" where we put the letter inside of the quotes.

shootings$race[shootings$race == "O"] <- "Other"
shootings$race[shootings$race == "N"] <- "Native American"
shootings$race[shootings$race == "A"] <- "Asian"
shootings$race[shootings$race == "H"] <- "Hispanic"
shootings$race[shootings$race == "B"] <- "Black"
shootings$race[shootings$race == "W"] <- "White"

Now let’s see how our graph looks. We’ll need to rerun the factor() code since now all of the values are changed.

shootings$race <- factor(shootings$race,
                         levels = names(sort(table(shootings$race), decreasing = FALSE)))
ggplot(shootings, aes(x = race)) + 
  geom_bar() +
  coord_flip() 

As earlier, we can show percentage instead of count by adding y = (..count..)/sum(..count..) to the aes() in geom_bar().

ggplot(shootings, aes(x = race)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  coord_flip() 

7.4 Graphing Data Over Time

We went over time-series graphs in Chapter 6 but it’s such an important topic we’ll cover it again. A lot of criminology research is seeing if a policy had an effect, which means we generally want to compare an outcome over time (and compare the treated group to a similar untreated group). To graph that we look at an outcome, in this case numbers of killings, over time. In our case we aren’t evaluating any policy, just seeing if the number of police killings change over time.

We’ll need to make a variable to indicate that the row is for one shooting. We can call this “dummy” and assign it a value of 1. Then we can make the ggplot() and set this “dummy” column to the y-axis value and set our date variable “date” to the x-axis (the time variable is always on the x-axis). Then we’ll set the type of plot to geom_line() so we have a line graph showing killings over time.

shootings$dummy <- 1
ggplot(shootings, aes(x = date, y = dummy)) +
  geom_line()

This graph is clearly wrong. Why? Well, our y-axis variable is always 1 so there’s no variation to plot. Every single value, even if there are more than one shooting per day, is on the 1 line on the y-axis. And the fact that we have multiple killings per day is an issue because we only want a single line in our graph. We’ll need to aggregate our data to some time period (e.g. day, month, year) so that we have one row per time-period and know how many people were killed in that period. We’ll start with yearly data and then move to monthly data. Since we’re going to be dealing with dates, lets load the lubridate() package that is well-suited for this task.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

We’ll use two functions to create variables that tell us the month and the year of each date in our data. We’ll use these new variables to aggregate our data to that time unit. First, the floor_date() function is a very useful tool that essentially rounds a date. Here we have the exact date the killing happened on, and we want to determine what month that date is from. So we’ll use the parameter unit in floor_date() and tell the function we want to know the “month” (for a full set of options please see the documentation for floor_date() by entering ?floor_date in the console). So we can do floor_date(shootings$date, unit = "month") to get the month - specifically, it returns the date that is the first of the month for that month - the killing happened on. Even simpler, to get the year, we simple use year() and put our “date” variable in the parentheses. We’ll call the new variables “month_year” and “year”, respectively.

shootings$month_year <- floor_date(shootings$date, unit = "month")
shootings$year <- year(shootings$date)

head(shootings$month_year)
#> [1] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
#> [6] "2015-01-01"
head(shootings$year)
#> [1] 2015 2015 2015 2015 2015 2015

Since the data is already sorted by date, all the values printed from head() are the same. But you can look at the data using View() to confirm that the code worked properly.

We can now aggregate the data by the “month_year” variable and save the result into a new dataset we’ll call monthly_shootings. For a refresher on aggregating, please see Section 3.3

monthly_shootings <- aggregate(dummy ~ month_year, data = shootings, FUN = sum)
head(monthly_shootings)
#>   month_year dummy
#> 1 2015-01-01    76
#> 2 2015-02-01    77
#> 3 2015-03-01    92
#> 4 2015-04-01    84
#> 5 2015-05-01    71
#> 6 2015-06-01    65

Since we now have a variable that shows for each month the number of people killed, we can graph this new dataset. We’ll use the same process as earlier but our dataset is now monthly_shootings instead of shootings and the x-axis variable is “month_year” instead of “date”.

ggplot(monthly_shootings, aes(x = month_year, y = dummy)) +
  geom_line()

The process is the same for yearly data.

yearly_shootings <- aggregate(dummy ~ year, data = shootings, FUN = sum)
ggplot(yearly_shootings, aes(x = year, y = dummy)) +
  geom_line()

Note the steep drop-off at the end of each graph. Is that due to fewer shooting occurring more recently? No, it’s simply an artifact of the graph comparing whole months (years) to parts of a month (year) since we haven’t finished this month (year) yet (and the data has a small lag in reporting).

7.5 Pretty Graphs

What’s next for these graphs? You’ll likely want to add labels for the axes and the title. We went over how to do this in Section 6.3 so please refer to that for more info. Also, check out ggplot2’s website to see more on this very versatile package. As I’ve said all chapter, a lot of this is going to be personal taste so please spend some time exploring the package and changing the appearance of the graph to learn what looks right to you.

7.5.1 Themes

In addition to making changes to the graph’s appearance yourself, you can use a theme that someone else made. A theme is just a collection of changes to the graph’s appearance that someone put in a function for others to use. Each theme is different and is fairly opinionated, so you should only use one that you think looks best for your graph. To use a theme, simply add the theme (exactly as spelled on the site) to your ggplot using the + as normal (and make sure to include the () since each theme is actually a function. ggplot2 comes with a series of themes that you can look at here. Here, we’ll be looking at themes from the ggthemes package which is a great source of different themes to modify the appearance of your graph. Check out this website to see a depiction of all of the possible themes. If you don’t have the ggthemes package installed, do so using `install.packages(“ggthemes”).

Let’s do a few examples using the graph made above. First, we’ll need to load the ggthemes library.

library(ggthemes)
ggplot(yearly_shootings, aes(x = year, y = dummy)) +
  geom_line() +
  theme_fivethirtyeight()

ggplot(yearly_shootings, aes(x = year, y = dummy)) +
  geom_line() +
  theme_tufte()

ggplot(yearly_shootings, aes(x = year, y = dummy)) +
  geom_line() +
  theme_few()

ggplot(yearly_shootings, aes(x = year, y = dummy)) +
  geom_line() +
  theme_excel()


  1. It is especially important to not overreach when trying to answer a question when the data can’t do it well. Often, no answer is better than a wrong one - especially in a field with serious consequences like criminology. For example, using the current data we’d determine that there’s no (or not as much as people claim) racial bias in police killings. If we come to that conclusion based on the best possible evidence, that’s okay - even if we’re wrong. But coming to that conclusion based on inadequate data could lead to policies that actually cause harm. This isn’t to say that you should never try to answer questions since no data is perfect and you may be wrong. You should try to develop a deep understanding of the data and be confident that you can actually answer those questions with confidence.↩︎