21.1 What is Git and why do I need it?
As you write R code you will - I hope! - save your R script from time to time (preferably using RStudio’s auto-save feature) to avoid losing any code you’ve written if you close R or shut down your computer. This is important as it’ll save everything you’ve done locally but if your computer crashes, you’ll want your work to be backed up elsewhere. While you should have something like Dropbox or Google Drive that keeps backups of your work, here we’ll talk about Git which is a version control software that gives you much more control (but requires more work) of the saved work than from something like Dropbox.10 Before getting into exactly how to use Git, we’ll talk first on what it is and how it’ll help your work. Git is also a very powerful and complex tool so this guide is going to be touching just a small - but useful to most researchers and R programmers - part of it.
With backup software such as Dropbox, it’ll save your work very frequently - so frequently in fact that I turn off Dropbox when I write R since it keeps interrupting me by saving at the moment I’m typing, which stops the typing. Below is the Dropbox page for some R code that I’ve been working on to scrape COVID data. Notice the timestamps - 4/5 of them are within one minute, showing how often Dropbox is saving changes. This is useful if I need the most recent update - or to share the most recent version with a collaborator. Here’s the big issue - and the one that Git solves - I have four versions within a minute of each other, what’s the difference between them? Dropbox is saving automatically and doesn’t indicate how they’re different (clicking on the file shows the complete file, not differences relative to some previous version) which means if I mess up some code a while ago, I can’t easily see which version is the one that works. With Git you can essentially wait until you’ve made enough changes to decide that these changes merit a new “version” of your work (One way to think about this is )
If you’re ever used the track changes feature on a Word Document(or Google Doc or Overleaf, etc.), the concept is similar. When you have this setting in a Word or Google Doc every time you (or anyone else) makes changes in that document, those changes, who made them, and when they occurred, is tracked. This makes it easy to see exactly what part of the file was changed and to undo that change if necessary. Below is an example of this feature on one of my drafts on Overleaf (basically a way to collaborate using LaTeX which is similar to RMarkdown). You can see each change that my co-author Aaron Chalfin made in the draft in the purpose changes the main part of the photo. The parts that were rewritten or added are highlighted in purple while the parts that were deleted have a purple . What is shown in purple isn’t all of the history of changes for this paper. If you look at the part on the right, highlighted in green, it shows what files were edited, by whom, and at what time. If you don’t like a change - or in R’s case more commonly, broke some code by accident - you can go back in the history of changes and return to an older version.
The way that R - and many other programming languages (and technically you can use this for any file or folder) does this “version control” is through Git.
The times you see are ones that Overleaf automatically set each change to.
You make changes to your code or RMarkdown file and the computer will track these changes.
21.2 Git basics
There are four main processes you need to know for a basic understanding of Git: checkout, add and commit, push, and pull. We’ll use the example of getting a book from the library to walk through using Git. The steps for this are simple, we go to the library, pick a book we want, check it out from the librarian, read it, and eventually return it. Using Git adds one wrinkle to this, we will want to write in the book and see what other people write too. Of course, when the book is checked out, no one else could write in our version, and no one can see what we write. So anything we write has to be done before we return the book to the library, then we check-out the book again to see what other people have written. When we want another book, we simply redo these steps.
|Library Steps||Git steps||Git code|
|Go to library|
|Find book and check-out book||Clone (usually will just be done once per project). RStudio helps with this. The code you see in the next cell is a little more complicated than normal since we’re using RStudio Server.||git clone path to repo, can be GitHub link|
|Read or write in book||This is done in R, not in Git||No git code, this is going to be whatever code we write in R. Also includes any outputs such as making a graph that is saved, RMarkdown outputs like a PDF, or even new R files.|
|Return book||Add & commit
|git add .
git commit –m “message indicating what we wrote”
|Check-out book again (to see what other people have written in it)||Pull||git pull|
Another way to think about commit vs push is that of writing an email. When you write an email, you’re essentially editing a blank document by adding the words of the email. When you save (but don’t send) the email, you are making a commit (essentially “committing” or promising to make a change). When you send the email you are making a push (taking something that you have written and changed and sending it to the main repository). While emails let you correspond directly between two or more people, how Git works is like sending the email to a central server (or a Post Office) and anyone who wants to read it has to go there. And when someone reads it and responds their email also goes to this central server. You have to go there to get their response (called a “pull” in Git terms) which is essentially an addition to your initial email.
21.3 When to commit
21.4 Code review
This came in handy for me as somehow one of my dissertation papers written in RMarkdown became empty a couple of months before my defense and I couldn’t undo that change. My Dropbox backup was older than my Git backup so having Git was a real time saver↩︎