For this chapter you’ll need the following file, which is available for download here: san_francisco_active_marijuana_retailers.csv.
Several recent studies have looked at the effect of marijuana dispensaries on crime around the dispensary. For these analyses they find the coordinates of each crime in the city and see if it occurred in a certain distance from the dispensary. Many crime data sets provide the coordinates of where each occurred, however sometimes the coordinates are missing - and other data such as marijuana dispensary locations give only the address - meaning that we need a way to find the coordinates of these locations.
25.1 Geocoding a single address
In this chapter we will cover how to geocode addresses. Geocoding is the process of taking an address (e.g. 123 Main Street, Somewhere, CA, 12345) and getting the longitude and latitude coordinates of that address. With these coordinates we can then do spatial analyses on the data ranging from simply making a map and showing where each address is to merging these coordinates with some other spatial data (such as seeing which police district the address is in) and seeing how it relates to other variables, such as crime.
To do our geocoding, we’re going to use the package
tidygeocoder which greatly simplifies the work of geocoding addresses in R. For more information about this package, please see the package’s site here. If you’ve never used this package before you’ll need to install it using
Now we need to tell R that we want to use this package by running
To geocode our addresses we’ll use the helpfully named
geocode() function inside of
geocode() we input an address and it returns the coordinates for that address. For our address we’ll use “750 Race St. Philadelphia, PA 19106” which is the address of the Philadelphia Police Department headquarters.
geocode("750 Race St. Philadelphia, PA 19106") #> Error: .tbl is not a dataframe. See ?geocode
As shown above, running
geocode("750 Race St. Philadelphia, PA 19106") gives us an error that tells us that “.tbl is not a dataframe.” The issue is that
geocode() expects a data.frame (and .tbl is an abbreviation for tibble which is a kind of data.frame), but we entered only the string with our one address, not a data.frame. For this function to work we need to enter two parameters into
geocode(): a data.frame (or something similar such as a tibble) and the name of the column which has the addresses.15 Since we need a data.frame, we’ll make one below. I’m calling it
address_to_geocode and calling the column with the address “address,” but you can call both the data.frame and the column whatever name you want.
<- data.frame(address = "750 Race St. Philadelphia, PA 19106")address_to_geocode
Now let’s try again. We’ll enter our data.frame
address_to_geocode first and then the name of our column which is “address.”
geocode(address_to_geocode, address) #> # A tibble: 1 x 3 #> address lat long #> <chr> <dbl> <dbl> #> 1 750 Race St. Philadelphia, PA 19106 40.0 -75.2
It worked, returning the same data.frame but with two additional columns with the latitude and longitude of that address.
You might be wondering why we put “address” into
geocode() without quotes when usually when we talk about a column we need to do so in quotes. The simple answer is that the authors of the
tidygeocoder package spent the time allowing users to input the column name either with or without quotes. Trying it again and now having “address” in quotes gives us the same result.
geocode(address_to_geocode, "address") #> # A tibble: 1 x 3 #> address lat long #> <chr> <dbl> <dbl> #> 1 750 Race St. Philadelphia, PA 19106 40.0 -75.2
There are two additional parameters which are important to talk about for this function, especially when you encounter an address not geocoding.
First, there are actually multiple sources where you can enter an address and get the coordinates for that address. Just think about the big mapping apps or sites, such as Google Maps and Apple Maps. For these sources you can enter in the same address and you’ll get different results. In most cases you’ll get extremely similar coordinates, usually off only after a few decimals points, so they are functionally identical. But occasionally you’ll have some addresses that can be geocoded through some sources but not others. This is because some sources have a more comprehensive list of addresses than others.
At the time of this writing the
tidygeocoder package can handle geocoding from 13 different sources. For 10 of these, however, you need to setup an API key and some also require paying money (usually after a set number of addresses that it’ll geocode for free each day). So here I’ll just cover the three sources of geocoding that don’t require any setup: “osm” (Open Street Map or OSM is similar to Google Maps), “census” (the US Census Bureau’s geocoder), and “arcgis” (ArcGIS is a clunky mapping software that nonetheless has an excellent geocoder that R can use). To select which of these to use (“osm” is the default), you add the parameter “method” and set that equal to which one you want to use. As “osm” is the default we actually don’t need to set it explicitly, but we’ll do so anyways here as an example of the three geocoding sources we want to use.
geocode(address_to_geocode, "address", method = "osm") #> # A tibble: 1 x 3 #> address lat long #> <chr> <dbl> <dbl> #> 1 750 Race St. Philadelphia, PA 19106 40.0 -75.2
geocode(address_to_geocode, "address", method = "census") #> # A tibble: 1 x 3 #> address lat long #> <chr> <dbl> <dbl> #> 1 750 Race St. Philadelphia, PA 19106 40.0 -75.2
geocode(address_to_geocode, "address", method = "arcgis") #> # A tibble: 1 x 3 #> address lat long #> <chr> <dbl> <dbl> #> 1 750 Race St. Philadelphia, PA 19106 40.0 -75.2
If you compare the longitude and latitudes from these three sources you’ll notice that they’re all different but only slightly so. By default this function returns a tibble instead of a normal data.frame so it only shows one decimal point by default - though it doesn’t actually round the number, merely shorten what it shows us. We can change the output back into a data.frame by using the
<- geocode(address_to_geocode, "address", method = "arcgis") example <- data.frame(example) example example#> address lat long #> 1 750 Race St. Philadelphia, PA 19106 39.95488 -75.15205
Given how similar the coordinates are, you really only need to set the source of the geocoder in cases where one geocoder fails to find a match for the address.
The second important parameter is
full_results which is by default set to FALSE. When set to TRUE it gives more columns in the returning data.frame than just the longitude and latitude of that address. These columns differ for each geocoder source so we’ll look at all three.
geocode(address_to_geocode, "address", method = "osm", full_results = TRUE) #> # A tibble: 1 x 12 #> address lat long place_id licence osm_type osm_id #> <chr> <dbl> <dbl> <int> <chr> <chr> <int> #> 1 750 Race ~ 40.0 -75.2 2.88e8 Data © Op~ way 6.22e7 #> # ... with 5 more variables: boundingbox <list>, #> # display_name <chr>, class <chr>, type <chr>, #> # importance <dbl>
For OSM as a source we also get information about the address such as what type of place it is, a bounding box which is a geographic area right around this coordinate, the address for those coordinates in the OSM database, and a bunch of other variables that don’t seem very useful for our purposes such as the “importance” of the address. It’s interesting that OSM classifies this address as a “house” as the police headquarters for a major police department is quite a bit bigger than a house, so this is likely an misclassification of the type of address. The most important extra variable here is the address, called the “display_name.”
Sometimes geocoders will be quite a bit off in their geocoding because they match the address you inputted incorrectly to one in their database. For example, if you input “123 Main Street” and the geocoder thinks you mean “123 Maine Street” you may be quite a bit off in the resulting coordinates. When you only get coordinates returns you won’t know that the coordinates are wrong. Even if you know where an address is supposed to be it’s hard to catch errors like this. If you’re geocoding addresses in a single city and one point is in a different city (or completely different part of the world), then it’s pretty clear that there’s an error. But if the coordinates are simply in a wrong part of the city, but near other coordinates, then it’s very hard to notice a problem. So having an address to check against the one you inputted is a very useful way of validate the geocoding.
geocode(address_to_geocode, "address", method = "census", full_results = TRUE) #> # A tibble: 1 x 18 #> address lat long matchedAddress tigerLine.tiger~ #> <chr> <dbl> <dbl> <chr> <chr> #> 1 750 Race S~ 40.0 -75.2 750 RACE ST, PHI~ 131423677 #> # ... with 13 more variables: tigerLine.side <chr>, #> # addressComponents.fromAddress <chr>, #> # addressComponents.toAddress <chr>, #> # addressComponents.preQualifier <chr>, #> # addressComponents.preDirection <chr>, #> # addressComponents.preType <chr>, #> # addressComponents.streetName <chr>, ...
These results are similar to the OSM results and also have the matched address to compare your inputted address to. Most of the columns are just the address broken into different pieces (street, city, state, etc.) so are mostly repeating the address again in multiple columns.
geocode(address_to_geocode, "address", method = "arcgis", full_results = TRUE) #> # A tibble: 1 x 11 #> address lat long arcgis_address score location.x #> <chr> <dbl> <dbl> <chr> <int> <dbl> #> 1 750 Race S~ 40.0 -75.2 750 Race St, Phi~ 100 -75.2 #> # ... with 5 more variables: location.y <dbl>, #> # extent.xmin <dbl>, extent.ymin <dbl>, #> # extent.xmax <dbl>, extent.ymax <dbl>
For the ArcGIS results we have the matched address again, and then an important variable called “score” which is basically a measure of how confidence ArcGIS is that it matched the right address. Higher values are more confidence, but in my experience anything under 90-95 confidence is an incorrect address. These results also repeat the longitude and latitude columns as “location.x” and “location.y” columns, and I’m not sure why they do so.
25.2 Geocoding San Francisco marijuana dispensary locations
So now that we can use the
geocoder() function well, we can geocode every location in our marijuana dispersary data.
Let’s read in the marijuana dispensary data which is called “san_francisco_active_marijuana_retailers.csv” and call the object marijuana. Note the “data/” part in front of the name of the .csv file. This is to tell R that the file we want is in the “data” folder of our working directory. Doing this is essentially a shortcut to changing the working directory directly.
library(readr) <- read_csv("data/san_francisco_active_marijuana_retailers.csv") marijuana #> Rows: 33 Columns: 11 #> -- Column specification ------------------------------------ #> Delimiter: "," #> chr (11): License Number, License Type, Business Owner, ... #> #> i Use `spec()` to retrieve the full column specification for this data. #> i Specify the column types or set `show_col_types = FALSE` to quiet this message. <- as.data.frame(marijuana)marijuana
Let’s look at the top 6 rows.
head(marijuana) #> License Number License Type #> 1 C10-0000614-LIC Cannabis - Retailer License #> 2 C10-0000586-LIC Cannabis - Retailer License #> 3 C10-0000587-LIC Cannabis - Retailer License #> 4 C10-0000539-LIC Cannabis - Retailer License #> 5 C10-0000522-LIC Cannabis - Retailer License #> 6 C10-0000523-LIC Cannabis - Retailer License #> Business Owner #> 1 Terry Muller #> 2 Jeremy Goodin #> 3 Justin Jarin #> 4 Ondyn Herschelle #> 5 Ryan Hudson #> 6 Ryan Hudson #> Business Contact Information #> 1 OUTER SUNSET HOLDINGS, LLC : Barbary Coast Sunset : Email- firstname.lastname@example.org : Phone- 5107173246 #> 2 URBAN FLOWERS : Urban Pharm : Email- email@example.com : Phone- 9168335343 : Website- www.up415.com #> 3 CCPC, INC. : The Green Door : Email- firstname.lastname@example.org : Phone- 4155419590 : Website- www.greendoorsf.com #> 4 SEVENTY SECOND STREET : Flower Power SF : Email- email@example.com : Phone- 5103681262 : Website- flowerpowerdispensary.com #> 5 HOWARD STREET PARTNERS, LLC : The Apothecarium : Email- Ryan@apothecarium.com : Phone- 4157469001 : Website- www.apothecarium.com #> 6 DEEP THOUGHT, LLC : The Apothecarium : Email- firstname.lastname@example.org : Phone- 4157469001 : Website- www.Apothecarium.com #> Business Structure #> 1 Limited Liability Company #> 2 Corporation #> 3 Corporation #> 4 Corporation #> 5 Limited Liability Company #> 6 Limited Liability Company #> Premise Address #> 1 2165 IRVING ST san francisco, CA 94122 County: SAN FRANCISCO #> 2 122 10TH ST SAN FRANCISCO, CA 941032605 County: SAN FRANCISCO #> 3 843 Howard ST SAN FRANCISCO, CA 94103 County: SAN FRANCISCO #> 4 70 SECOND ST SAN FRANCISCO, CA 94105 County: SAN FRANCISCO #> 5 527 Howard ST San Francisco, CA 94105 County: SAN FRANCISCO #> 6 2414 Lombard ST San Francisco, CA 94123 County: SAN FRANCISCO #> Status Issue Date Expiration Date #> 1 Active 9/13/2019 9/12/2020 #> 2 Active 8/26/2019 8/25/2020 #> 3 Active 8/26/2019 8/25/2020 #> 4 Active 8/5/2019 8/4/2020 #> 5 Active 7/29/2019 7/28/2020 #> 6 Active 7/29/2019 7/28/2020 #> Activities Adult-Use/Medicinal #> 1 N/A for this license type BOTH #> 2 N/A for this license type BOTH #> 3 N/A for this license type BOTH #> 4 N/A for this license type BOTH #> 5 N/A for this license type BOTH #> 6 N/A for this license type BOTH
So the column with the address is called Premise Address. Since it’s easier to deal with columns that don’t have spacing in the name, we will be using
gsub() to remove spacing from the column names. Each address also ends with “County:” followed by that address’s county, which in this case is always San Francisco. That isn’t normal in an address so it may affect our geocode. We need to
gsub() that column to remove that part of the address.
names(marijuana) <- gsub(" ", "_", names(marijuana))
Since the address issue is always " County: SAN FRANCISCO" we can just
gsub() out that entire string.
$Premise_Address <- gsub(" County: SAN FRANCISCO", "", marijuana$Premise_Address)marijuana
Now let’s make sure we did it right.
names(marijuana) #>  "License_Number" #>  "License_Type" #>  "Business_Owner" #>  "Business_Contact_Information" #>  "Business_Structure" #>  "Premise_Address" #>  "Status" #>  "Issue_Date" #>  "Expiration_Date" #>  "Activities" #>  "Adult-Use/Medicinal" head(marijuana$Premise_Address) #>  "2165 IRVING ST san francisco, CA 94122" #>  "122 10TH ST SAN FRANCISCO, CA 941032605" #>  "843 Howard ST SAN FRANCISCO, CA 94103" #>  "70 SECOND ST SAN FRANCISCO, CA 94105" #>  "527 Howard ST San Francisco, CA 94105" #>  "2414 Lombard ST San Francisco, CA 94123"
To do the geocoding we’ll just tell
geocode our data.frame name and the name of the column with the addresses. We’ll save the results back into the
marijuana object. As noted earlier, we don’t need to put the name of our column in quotes, but I like to do so because it is consistent with some other functions that require it. Running this code may take up to a minute because it’s geocoding 33 different addresses.
<- geocode(marijuana, "Premise_Address")marijuana
Now it appears that we have longitude and latitude for every dispensary. We should check that they all look sensible.
summary(marijuana$long) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> -122.5 -122.4 -122.4 -122.4 -122.4 -122.4 10
summary(marijuana$lat) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 37.71 37.75 37.78 37.77 37.78 37.80 10
The minimum and maximum are very similar to each other for both longitude and latitude so that’s a sign that it geocoded correctly. The 10 NA values mean that it didn’t find a match for 10 of the addresses. Let’s try again and now set
method to “arcgis” which generally has a very high match rate. Before we do this let’s just remove the entire latitude and longitude columns from our data. How the
geocode() function works is that if we keep the “long” and “lat” columns that are currently in the data from when we just geocoded, when we run it again it’ll make new columns that have nearly identical names. We usually want as few columns in our data as possible so there’s no point having the “lat” column from the last geocode run with the 10 NAs and another “lat” (though slightly different, automatically chosen name) column from this time we run
We could also just geocode the 10 addresses that failed on the first run, but given that we’ll only geocoding a small number of addresses it won’t take much extra time to have ArcGIS run it all. Running this function on just the NA rows requires a bit more work than just rerunning them all. In general, when the choice is between you spending time writing code and letting the computer do more work, let the computer do the work. And in general I’d recommend starting with ArcGIS as it is more reliable for geocoding. We’ll remove the current coordinate columns by setting them each to NULL.
$long <- NULL marijuana$lat <- NULL marijuana<- geocode(marijuana, "Premise_Address", method = "arcgis")marijuana
And let’s do the
summary() check again.
summary(marijuana$long) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> -122.5 -122.4 -122.4 -122.4 -122.4 -122.4
summary(marijuana$lat) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 37.71 37.76 37.77 37.77 37.78 37.80
No more NAs which means that we successfully geocoded our addresses. Another check is to make a simple scatterplot of the data. Since all the data is from San Francisco, they should be relatively close to each other. If there are dots far from the rest, that is probably a geocoding issue.
Most points are within a very narrow range so it appears that our geocoding worked properly.
We can look at all of the parameters for this function by running the code
?geocode()to look at the functions Help page.↩︎