6.1 Code review
When you collaborate with other people, you will probably each be working on a separate (though related) part of the project and then will combine each part when you are done. Combining your code could be through emailing each other R scripts - and having one person combine everything - or something more formalized such as using Git, which we discuss in Chapter 9. However you decide to do this, it is important to use a process to review your collaborator’s code (and have them review yours) to check for mistakes.5 This is a similar process to having a colleague read a paper draft before submitting it.
Code review is a useful technique for reducing the number of mistakes as it is a check on the work before using the code for real. Code review generally involves having one person who writes the code send it to another person who checks the code for any potential mistakes or issues. This check involves ensuring that the code meets the specified style (this is discussed further in Section 6.1.1) and that there are no bugs (which are errors in the code). For the person having their code reviewed, having comments explaining the what and why of the code (discussed more in Section 6.2.1) will help the reviewer quickly go through the code. The code should also be relatively short, comprised of a specific R script (or related scripts) and no more than a few hundred lines of code. This is because as code gets more interrelated and complex, it is harder for someone unfamiliar with the code to understand it and see any issues. That means that a reviewer for long code is more likely to miss issues and take longer to review. Reviewing shorter code, even if that means reviewing more often, is often far more efficient for both the reviewer and reviewee and catches more issues.
In cases where you have unit tests (which are discussed in Chapter 8) written for the code, these tests are an automated form of code review as they too check for mistakes. To save people’s time, you should avoid sending the code for review until it passes all unit tests. However, if you’re stuck and can’t get certain tests to pass, working with someone else to solve the problem is often faster than doing so yourself because then you have an outside perspective who may see something that you missed.
For code review to be most efficient, I recommend developing some rules with your collaborators to specify how and when code review is done. For example, you should determine who reviews certain people’s code (ideally with senior people reviewing junior people’s code) and how often it is done. I believe that doing code reviews relatively frequently (i.e. after a working draft of some code is ready) is useful as you can catch issues early and not waste anyone’s (especially the person writing the code) time. However, having hard time limits is probably ill-advised as sometimes writing certain code takes far longer than expected and reviewing an unfinished (and potentially far from finished) bunch of code is not efficient for anyone.
When someone is reading a draft of your research paper, they are generally looking for whether it is correct (i.e. your methods are right, the lit review is thorough, etc.) and how well it flows. Code review is the same. While the primary goal is finding errors, an important aspect is to ensure that it is readable (i.e. proper spacing, how names are written) and consistent across everyone’s code in that group. More formally, ensuring that everyone’s code is readable and consistent is having people follow a style guideline.
6.1.1 Style guidelines
An important part of reviewing people’s code is ensuring that everyone is following the same style guidelines when it comes to writing code. Style guidelines are the grammar rules of writing code. They dictate (or encourage) certain style choices, such as whether object names are lowercase, whether they include punctuation, and even when to put long code on a new line. This is equivalent to making sure that people writing in plain language put punctuation and capitalization in the expected place. While you can read !SomEThiNg WrITen. LiKE thIs, it is easier to understand when it follows adopted and accepted rules.
The important thing here is to be consistent. Consistency makes code much easier to read and helps make code written by multiple people more interchangeable. This book follows the tidyverse style guide, which is one that many R programmers follow, but the exact style you choose is relatively unimportant (choosing more common styles helps when your code may be used by people out of your organization). Feel free to adopt an already-made style guide, make any modifications to suit your preferences, or to create an entirely new one yourself. As long as people follow the same format, you’ll be able to spend more time on the code, and less time trying to understand it.
An important, though occasionally tedious, part of writing code is documenting your work. We’ll talk about documentation in two ways, through comments, which focus on specific parts of code, and vignettes, which document the project more broadly.
Vignettes are essentially a document that explains how to do something with the code you have written. This is common when someone has written an R package and they want to explain in detail important functions from the package. You can think of chapters of this book as vignettes covering particular topics - PDF scraping, webscraping, regular expressions, etc. To make a vignette, you can make an R Markdown file (for more information on R Markdown please see Chapter 7) detailing that topic. Since the text you write is included in the document, these files are basically normal R scripts with extensive comments written in plain language. Often, these comments are more formal than what you’d write in an R script as they are written as complete sentences or paragraphs and walk through comprehensive ideas rather than focus on discrete chunks of code.
One increasingly prominent method of using R for research is to do everything in an R Markdown file. This allows you to explain your approach - including context on why you did something - and each step you took in plain language in the text of the R Markdown file while still including the code directly in the file. Whether you include the code in the output (e.g. a PDF or Word Document), or just the result of the code (e.g. a graph or table), depends on your audience and how far along you are in the project.
If this is for a presentation to update collaborators, for example, it is useful to include the code as they may notice an issue or give advice based on the code. Including code can also teach your audience something new (I’ve certainly learned a lot by watching people present using code I wasn’t familiar with). If the document is for an audience unfamiliar with R (or programming more generally), or where time to present is limited, you probably won’t want to include code.
Whether you do your work in an R script or in an R Markdown file is up to you. If you intend to write up a report anyway, having everything written up in the R Markdown file as you write your code can save you time as you’re merging the code and the writing process. However, this loses some nice features in R such as unit tests, which we will discuss in detail in Chapter 8. It also depends on how complex your project is. If you have code that is hundreds of lines long and spans multiple R scripts, putting it all into a single R Markdown file is unfeasible. In this case it’d be better to run the code in the R scripts and use the R Markdown file just to present results.
If your collaborator does not know R, they should read this book.↩︎
I recently worked on a follow-up paper to one I had done a year ago. For some reason, past me decided to name some functions based on the authors of a paper that created that particular method, and didn’t leave comments explaining what the code did or why. Past me caused a lot of problems for current me. Please comment your code!↩︎
This is one of the main reasons I wrote this book. After a few years of helping Penn students with the same questions, I decided to write out guides to those topics.↩︎
If you’re like me and on your seventh rejection for a particular paper, three to six months may be optimistic.↩︎
In Section 2.1 we introduced comments, which are essentially notes about the code that you include in an R script (by starting a line with the pound key #) that isn’t run. They are just “comments” to yourself or anyone else reading the code to explain what that code does and why it is there. As is often repeated in explaining the benefit of comments, the main collaborator you will have is yourself in the future.6 You don’t need to comment on every single line of code - and doing so would just make it hard to read - but you should comment on important things or chunks of code (i.e. several lines of code that all are for the same purpose). If you write a function, you’ll want at least a brief comment explaining what it does and what the inputs and parameters do.
Writing comments is not as fun as writing code. Stopping to write a comment on something that seems obvious at the time (after-all, you figured out how to do something you wanted to do and likely were focusing on) interrupts the flow of writing code and slows down your work. And when you have looming deadlines and multiple projects that you’re working on, spending the time writing good comments may seem like a bad use of time as the payoff is only in the future. However, the benefits far outweigh the cost. This is true for two reasons. First, when you’re collaborating with others, it is much quicker to have text explaining the code than to walk through the code with them (or to have them try to figure it out themselves).7 As you work with more people, comments become increasingly important. Writing good comments is also time-efficient when considering that in many cases when you do research you will have to return to the project in the future.
This is best shown when considering a research project that leads to a journal article. For many papers, even if you are fantastically productive and can work nonstop at it without forgetting any decisions, at a certain point you’ll need to finish and submit it to a journal. Journal reviews can often take three to six months so at that point you’ll likely have forgotten many of the (seemingly obvious) decisions you made in the course of the project.8 Having comments explaining why you made a certain decision (such as including or excluding certain crime types from your analysis) can be a huge time-saver when addressing reviewer concerns - you will know why each decision was made and won’t have to try to figure out the why. This is particularly important when you have to defend a decision in which there is no obvious choice and you want to know your thought process at the time you wrote the code and were immersed in the issues of the data. A lot of data decisions are reasonable at the time based on the quirks of the data but can appear to make no sense if you aren’t familiar with the data - comments can remind you of the quirkiness and how you handled it.