5 Reading and Writing Data
So far in these lessons we’ve used data from a number of sources but which all came as .rda files which is the standard R data format. Many data sets, particularly older government data, will not come as .rda file but rather as Excel, Stata, SAS, SPSS, or fixed-width ASCII files. In this brief lesson we’ll cover how to read these formats into R as well as how to save data into these formats. Since many criminologists do not use R, it is important to be able to save the data in the language they use to be able to collaborate with them.
Fixed-width ASCII files are not very common and require a bit more effort than the other formats so we’ll leave those until later to discuss.
In this lesson we’ll use data about officer-involved shootings.
5.1 Reading Data into R
As we’ve seen earlier, to read in data with a .rda or .rdata extension you use the function
load() with the file name (including the extension) in quotation marks inside of the parentheses. This loads the data into R and calls the object the name it was when it was saved. Therefore we do not need to give it a name ourselves.
For each of the other types of data we’ll need to assign a name to the data we’re reading in so it has a name. Whereas we’ve done
x <- 2 to say x gets the value of 2, now we’d do
x <- DATA where DATA is the way to load in the data and x will get the entire data.frame that is read in.
To read in Excel files, those ending in .csv, we can use the function
read_csv() from the package
readr (the function
read.csv() is included in R by default so it doesn’t require any packages but is far slower than
read_csv() so we will not use it).
The input in the () is the file name ending in “.csv”. As it is telling R to read a file that is stored on your on the computer, the whole name must be in quotes. Unlike loading an .rda file using
load(), there is no name for the object that gets read in so we must assign the data a name. We can use the name shootings as it’s relatively descriptive and easy for us to write.
shootings <- read_csv("data/fatal-police-shootings-data.csv") #> Parsed with column specification: #> cols( #> id = col_double(), #> name = col_character(), #> date = col_date(format = ""), #> manner_of_death = col_character(), #> armed = col_character(), #> age = col_double(), #> gender = col_character(), #> race = col_character(), #> city = col_character(), #> state = col_character(), #> signs_of_mental_illness = col_logical(), #> threat_level = col_character(), #> flee = col_character(), #> body_camera = col_logical() #> )
read_csv() also reads in data to an object called a
tibble which is very similar to a data.frame but has some differences in displaying the data. If we run
head() on the data it doesn’t show all columns. This is useful to avoid accidentally printing out a massive amounts of columns.
head(shootings) #> # A tibble: 6 x 14 #> id name date manner_of_death armed age gender race city state #> <dbl> <chr> <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 3 Tim ~ 2015-01-02 shot gun 53 M A Shel~ WA #> 2 4 Lewi~ 2015-01-02 shot gun 47 M W Aloha OR #> 3 5 John~ 2015-01-03 shot and Taser~ unar~ 23 M H Wich~ KS #> 4 8 Matt~ 2015-01-04 shot toy ~ 32 M W San ~ CA #> 5 9 Mich~ 2015-01-04 shot nail~ 39 M H Evans CO #> 6 11 Kenn~ 2015-01-04 shot gun 18 M W Guth~ OK #> # ... with 4 more variables: signs_of_mental_illness <lgl>, threat_level <chr>, #> # flee <chr>, body_camera <lgl>
We can convert it to a data.frame using the function
as.data.frame() though that isn’t strictly necessary since tibbles and data.frames operate so similarly.
For the remaining three data types we’ll use the package
haven follows the same syntax for each data type and is the same as with
read_csv() - for each data type we simply include the file name (in quotes, with the extension) and designate an name to be assigned the data.
read_csv() the functions to read data through
haven all start with
read_ and end with the extension you’re reading in.
read_dta()- Stata file, extension “dta”
read_sas()- SAS file, extension “sas”
read_sav()- SPSS file, extension “sav”
To read the data as a .dta format we can copy the code to read it as a .csv but change .csv to .dta.
Since we called this new data shootings, R overwrote that object (without warning us!). This is useful because we often want to subset or aggregate data and call it by the same name to avoid making too many objects to keep track of, but watch out for accidentally overwriting an object without noticing!
5.2 Writing Data
When we’re done with a project (or an important part of a project) or when we need to send data to someone, we need to save the data we’ve worked on in a suitable format. For each format, we are saving the data in we will follow the same syntax of
As usual we start with the function name. Then inside the parentheses we have the name of the object we are saving (as it refers to an object in R we do not use quotations) and then the file name, in quotes, ending with the extension you want.
For saving an .rda file we use the
save() function, otherwise we follow the syntax of
write_ ending with the file extension.
write_csv()- Excel file, extension “csv”
write_dta()- Stata file, extension “dta”
write_sas()- SAS file, extension “sas”
write_sav()- SPSS file, extension “sav”
As with reading the data,
write_csv() comes from the
readr package while the other formats are from the
For saving an .rda file we must set the parameter
file to be the name we’re saving. For the other types of data they use the parameter
path rather than
file but it is not necessary to call them explicitly.