Introduction

Background/Motivation

The popgenInfo project was designed to be an educational tool that can move at the same pace as method development by allowing people skilled in population genetics analysis to submit workflows that demonstrate the strengths of any given analysis in R. We use git and GitHub to accept contributions via pull requests using GitHub flow.

What is git?

Git is a version control program. It allows you to keep track of all changes that happen when you are writing, analyzing data, or coding. You can think of it as a sort of electronic lab notebook. Just like you would record any weights or measures carefully in your bench lab notebook, so should you record any analyses you do on the computer. With git, you “add” and “commit” changes that you make. These are analogous to writing down the values of your measurements and noting how you took them.

The basic steps of the process are:

  1. Fork the popgenInfo repository*
  2. Clone your fork to your local machine.*
  1. Checkout your master branch, fetch the changes from NESCent and merge them into your fork^.
  2. Create a new branch and then add and commit new changes or content.
  3. Push your changes to your fork.
  4. Open a pull request from your fork to NESCent.

* These should only be done once.
^ Start here if you already have cloned your fork.

This particular tutorial will show you the basics of contributing to the popgenInfo website using the R package git2r, which was developed by the rOpenSci project. This package allows for a nearly seamless and uniform interface for all contributors regardless of operating system. A companion set of videos to go along with this tutorial can be found at https://www.youtube.com/playlist?list=PLSFzyC3wp8-doiOcjDLlryIojK29qX2NL.

Requirements

We assume that you have already done a few things:

  1. Sign up for a GitHub account
  2. Install R
  3. Install Rstudio

If you have also installed git, you can set up Rstudio to work with git to help you commit changes, pull, and push. For the purposes of this tutorial, these things are not necessary.

Step i: Forking the repository (DO ONCE)

If you have already forked and cloned the repository, go to Step 1.

Go to https://github.com/NESCent/popgenInfo and click on the “Fork” button. This will create a copy of the NESCent popgenInfo repository to your account. After you’ve forked the repository, you never have to fork it again!

Step ii: Clone your fork to your computer (DO ONCE)

Now that you have your repository forked to your account, you will need to clone it to your computer. Just like forking, you only have to clone your fork to your computer once. It will live there until you decide to remove it. We will need two things to clone our repository:

  1. a place you want to store this repostiory (eg. a folder called forks in your Documents)
  2. git2r

Create your forks folder

First, we need to make sure that we are working in the folder where we want to set up our fork. In this tutorial, we will use setwd() to do this via the R console, but it is possible to do this via Rstudio.

If you are using Windows:

fork_dir <- "~/forks"

If you are using OSX:

fork_dir <- "~/Documents/forks"

Now, we create the directory and then move there:

dir.create(fork_dir)

We can use the function list.files() to see what’s inside the directory:

list.files()
## character(0)

Currently, there’s nothing here. We will use the clone() function from git2r to make a copy of our repository. If you type help("clone", "git2r") in your R console, you can see documentation about the function. We need two things from this function:

  1. The URL of the git repository
  2. The path to our folder where we want to store it

The URL for the git repository is simply just the URL for your fork with .git at the end of it. For mine, it’s https://github.com/zkamvar/popgenInfo.git. To keep things simple, we will name the folder we want to put the repository in “popgenInfo”.

library("git2r")
clone(url = "https://github.com/zkamvar/popgenInfo.git",
      local_path = "popgenInfo")
## Loading required package: methods
## cloning into '/tmp/Rtmp0tQOP9/popgenInfo'...
## Receiving objects:   1% (25/2446),    9 kb
## Receiving objects:  11% (270/2446),   56 kb
## Receiving objects:  21% (514/2446),  104 kb
## Receiving objects:  31% (759/2446), 1949 kb
## Receiving objects:  41% (1003/2446), 7413 kb
## Receiving objects:  51% (1248/2446), 11678 kb
## Receiving objects:  61% (1493/2446), 11719 kb
## Receiving objects:  71% (1737/2446), 11743 kb
## Receiving objects:  81% (1982/2446), 11799 kb
## Receiving objects:  91% (2226/2446), 11999 kb
## Receiving objects: 100% (2446/2446), 12005 kb, done.
## Local:    master /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   master @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [0f515c1] 2016-07-18: Add publication and citation to README and contribution guidelines (#192)

Now when we look at our directory, we can see that we have one folder called “popgenInfo”:

list.files()
## [1] "popgenInfo"

We can move into the folder with setwd() and look at all the files in there.

setwd("popgenInfo")
list.files()
##  [1] "build"                       "circle.yml"                 
##  [3] "CONDUCT.md"                  "CONTRIBUTING_WITH_GIT2R.Rmd"
##  [5] "CONTRIBUTING.md"             "data"                       
##  [7] "DATAFORMATS.md"              "DESCRIPTION"                
##  [9] "develop"                     "developer_example.Rmd"      
## [11] "GETSTARTED.md"               "images"                     
## [13] "index.md"                    "install"                    
## [15] "LEARN.md"                    "LICENSE"                    
## [17] "Makefile"                    "PACKAGES.md"                
## [19] "popgenInfo.Rproj"            "R_MARKDOWN.md"              
## [21] "README.md"                   "SNP.md"                     
## [23] "SSR.md"                      "TEMPLATE.Rmd"               
## [25] "use"                         "USINGR.md"                  
## [27] "WORKFLOWS.md"

Now we have successfully cloned our repository into our computer using git2r. Next, we will set up our credentials and keep our master branch up to date.

Setting up your clone

To access the repository, all you have to do is open it by double-clicking on the “popgenInfo.Rproj” file. This will open Rstudio to this folder and set it as the working directory. After that, in your R console, type:

library("git2r") # in case you're starting from here
repo <- repository()
repo
## Local:    master /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   master @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [0f515c1] 2016-07-18: Add publication and citation to README and contribution guidelines (#192)

This tells R that you have a repository in this folder and that you are referring to it as repo. The output of repo is a summary of your repository:

  • Local This is the branch (master) and the path where the copy of your repository is.
  • Remote This shows you the branch and the url of the remote repository
  • Head This is a summary of the last commit containing the date and a description.

In order to be able to synch your clone with your fork, GitHub needs to know that you are really who you say you are. This means that you need to associate your name, email, and a secret token with this repository. The first two items are easily done within R. You can use the function config() to set and confirm your name and email:

config(repo, 
       user.name = "Zhian Kamvar",
       user.email = "kamvarz@science.oregonstate.edu")
config(repo)
## system:
##         credential.helper=cache --timeout=3600
##         push.default=simple
## local:
##         branch.master.merge=refs/heads/master
##         branch.master.remote=origin
##         core.bare=false
##         core.filemode=true
##         core.logallrefupdates=true
##         core.repositoryformatversion=0
##         remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*
##         remote.origin.url=https://github.com/zkamvar/popgenInfo.git
##         user.email=kamvarz@science.oregonstate.edu
##         user.name=Zhian Kamvar

Next, we need to set our secret token. Known as a Personal Access Token (PAT), this is a 40 character cryptographic token that is like a long and complicated password. They both allow access to your GitHub repository, but the difference between a PAT and a password is that a PAT can be stored as text on your computer (which you should NEVER do with passwords) because it is easy to generate and (most importantly) easy to remove. We can create a PAT from GitHub and then we will store this in a file called .Renviron that is read every time you restart your R session, allowing you to have this token whenever you need to push changes to your fork on GitHub. The instructions to generate your PAT are here.

Here’s an example of what a .Renviron file with a PAT will look like on the inside:

GITHUB_PAT="97cc6bf86c31a42fca2de32884cd1f1c4b1102ba"

Once you have created your PAT, make a text file called .Renviron inside of the repository and paste your PAT in that file as shown above. This will not be tracked by git, and will only exist on your computer.

The next time you open R from popgenInfo.Rproj, this token will be available for GitHub to ensure that you are who you say you are. It is important to note that, if you change or delete that PAT, you must replace it here.

In the next step, we will ensure that our fork is up to date with NESCent. In order to do that, we need to tell our local copy that it can take updates from NESCent. We do this by adding NESCent as a remote called ‘upstream’ using the function remote_add() and checking to make sure it worked with the function remotes().

remote_add(repo, 
           name = "upstream", 
           url = "https://github.com/NESCent/popgenInfo.git")

# Check our remotes
remotes(repo)
## [1] "origin"   "upstream"

With all of this information set, we can begin working on our repository.

Step 1: Keeping your fork up to date

The first thing we need to do is make sure that our master branch is up to date. Since other people are also contributing to this project, it is possible for changes to be made even within seconds of you forking the repository. In this section we will show you how you can keep your master branch up to date in your local repository and your fork on github.

Let’s assume that you’ve just opened a fresh Rstudio session with the popgenInfo.Rproj file. First, let’s load git2r and access our repository.

library("git2r")
repo <- repository()

Now that we have our repository loaded, we should make sure that we are on the master branch. We can do this using the checkout() function from git2r.

checkout(repo, "master")
repo
## Local:    master /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   master @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [0f515c1] 2016-07-18: Add publication and citation to README and contribution guidelines (#192)

Now, whether or not we suspect changes from NESCent, it’s always good form to update your local repository before creating any new branches. We can do it this by first fetching ALL changes that have happened on the NESCent popgenInfo repository, and then merging the NESCent master branch into our master branch. We will use the git2r functions fetch() and merge() to accomplish this.

First, we fetch all of the changes that have happened. If there are no changes, nothing will appear, but if there is new content on the master branch or new branches on the repository, they will be downloaded and each branch will be printed to your screen. For our purposes, we are only concerned with the changes on the master branch, which will look something like this: ## [new] b62a78c1fcd384a3ea23 refs/remotes/upstream/master.

fetch(repo, name = "upstream")
## [new]     b400fbaf7217875334ac refs/remotes/upstream/add-citation
## [new]     de9b2fd75306e0ae3128 refs/remotes/upstream/gh-pages
## [new]     2cb0623835daad482dc0 refs/remotes/upstream/master
## [new]     b0cb86880469dce8cabb refs/remotes/upstream/pr/194
## [new]     baa1358367cedf7a7fda refs/remotes/upstream/site-images
## [new]     16b52608de7f5ed41d04 refs/remotes/upstream/summaryseqstats
## [new]     22c85ea7575d25be7df2 refs/remotes/upstream/update-description
## [new]     8840ff6e3782db926b03 refs/remotes/upstream/update-makefile-doc
## [new]     c7e4171b3786e398e39b refs/remotes/upstream/web-devel
## [new]     021b4de353dff06f0c44 refs/tags/v1.0.0

It’s important to note that our local repository is still unchanged at this point as we have just downloaded the updates from NESCent. To update our local repository, we should merge the changes from NESCent’s master branch into our master branch:

merge(repo, "upstream/master")
## Merge: Fast-forward
repo
## Local:    master /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   master @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [2cb0623] 2017-01-03: update DESCRIPTION

If there were any changes, the Head field will have a more current commit than when we intialized it above.

Since we will never work directly on the master branch, there should be no conflicts from the NESCent branch and we now know that our branch is up to date. Assuming everything worked, we can push this to our fork on GitHub. We will use git2r’s push() function to do this. We also need to supply our PAT in order for this to work. To do so, we will use the git2r function cred_token():

cred <- cred_token()
push(repo, credentials = cred)

Now, when you check your fork, it should be updated with the most recent changes. From here, we can checkout a new branch and then proceed with adding our changes or new workflow.

Step 2: Creating a New Branch

When you add contributions, the best practice is to create a new branch off of the master. A branch is a term that indicates a sort of sandbox for a repository that can become permanent. By default, git repositories all have a “master” branch. When you want to experiment with something else, add new content, or simply just fix a typo, but you don’t want to disturb the original copy, you create a new branch. Good practice is to name the branch in a way that succinctly describes what you are doing with the branch. In this step, we will create a branch, push it to our fork, and track it.

SETUP

Most importantly we need to make sure that we are on the master branch and it is up to date. To double plus make sure that you are on the master branch, check it out:

checkout(repo, "master")
repo
## Local:    master /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   master @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [2cb0623] 2017-01-03: update DESCRIPTION

For reasons that will reveal themselves later, please run these function definitions:

my_branch <- function(x) paste0("refs/heads/", x)
my_origin <- function(x) paste0("origin/", x)

And, of course, don’t forget your credentials:

cred <- cred_token()

Creation

Assuming you’ve updated your master branch as shown above, we can create a branch from the master using the function checkout(). Let’s assume that we want to create a vignette for analyzing spatial statistics. I’m going to lay out a few steps here all at once because these are the steps you want to take when you create a brand new branch and make sure it exists on your computer and on your fork:

  1. save the branch name as a variable
  2. checkout the branch
  3. push the branch to the fork
  4. set your local repository to track the fork

In the first step, we want to name the branch. Since we are creating a new vignette, it’s ideal for the branch name to be the same as the vignette. As shown in the Best Practices guidelines, you should name this vignette with the date and the subject. Since we are contributing a vignette on spatial stats and committing for the first time on 2015-12-16, the new branch and vignette will be called “2015-12-16-spatial- stats”.

In code, the steps above would look like this:

BRANCH <- "2015-12-16-spatial-stats"                                         # 1
checkout(repo, BRANCH, create = TRUE)                                        # 2
push(repo, name = "origin", refspec = my_branch(BRANCH), credentials = cred) # 3
branch_set_upstream(head(repo), my_origin(BRANCH))                           # 4

Let’s examine these one by one. First, you create a variable that will hold the name of the branch you want to create. We are using this convention to allow us to use this name multiple times. After that, we create that branch from our master branch. We can take a look at what our branch looks like at that point.

BRANCH <- "2015-12-16-spatial-stats"  # 1
checkout(repo, BRANCH, create = TRUE) # 2
repo
## Local:    2015-12-16-spatial-stats /tmp/Rtmp0tQOP9/popgenInfo/
## Head:     [2cb0623] 2017-01-03: update DESCRIPTION

Notice here that you now only have two lines in the output, Local and Head. This is because the branch doesn’t exist on your GitHub fork. This is where the next two steps come in. We will push that branch and then make sure that our branch is tracking the copy on the fork.

push(repo, name = "origin", refspec = my_branch(BRANCH), credentials = cred) # 3
branch_set_upstream(head(repo), my_origin(BRANCH))                           # 4
repo
## Local:    2015-12-16-spatial-stats /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   2015-12-16-spatial-stats @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [2cb0623] 2017-01-03: update DESCRIPTION

We can now see from the output that we have a Remote set up. Once we have this, we can start making changes! If you are adding a new vignette, please copy TEMPLATE.Rmd to the use/ directory and give it a new name that is meaningful to your contribution (good practice is to name it the same as your branch). If you are including any data, place it in the data/ directory. Once you add these files, you need to to add these files and then you should commit, which we will show you how to do below.

Step 3: Adding content, committing, and pushing

If you are using Rstudio, you can use it to integrate with git. This allows you to do things like commit, push, and pull. Hadley Wickham, chief scientist at Rstudio has put up this helpful tutorial on using git through Rstudio. If you aren’t using Rstudio, but you don’t want to access the terminal (command line), this section will show you basic git commands with git2r that will allow you to work on your vignette and keep it up to date.

Recall that using git is analogous to keeping a good, detailed lab notebook. When you make any changes, you record those changes (git add) and then say why you made the changes (git commit).

Let’s say you’ve already copied the TEMPLATE.Rmd file to the use/ directory and renamed it to 2015-12-16-spatial-stats.Rmd and added some data called spatial_data.csv. When you place a file in your repository, git will not pay attention to any changes you do until you specifically add it. Until then it is “Untracked”. You can see this by running status():

status(repo)
## Untracked files:
##  Untracked:  data/spatial_data.csv
##  Untracked:  use/2015-12-16-spatial-stats.Rmd

Since we’ve just added the files, we can use add() to tell git we want to stage them for committing.

add(repo, "*") # adding all new or changed files
status(repo)
## Staged changes:
##  New:        data/spatial_data.csv
##  New:        use/2015-12-16-spatial-stats.Rmd

When we have done this, we can commit to our changes by using commit(). A commit is basically placing a record of what you did and, importantly, why you did it. Your commit message should record this. Commit messages should be able to fit on a single line. Optionally, if you want to be able to add more detail, you can enter more lines below the message:

msg <- " started spatial stats vignette and added data

I copied the template to the use folder and placed
spatial data in the data folder.
"
commit(repo, msg, session = TRUE)
## [0d26046] 2017-01-04:  started spatial stats vignette and added data
status(repo) # should return nothing
## working directory clean
repo
## Local:    2015-12-16-spatial-stats /tmp/Rtmp0tQOP9/popgenInfo/
## Remote:   2015-12-16-spatial-stats @ origin (https://github.com/zkamvar/popgenInfo.git)
## Head:     [0d26046] 2017-01-04:  started spatial stats vignette and added data

Note that we put session = TRUE in the commit function. This automatically puts your R session information in the commit, which gives more information about package versions you were working with when you were editing the vignette.

You can see a summary of what you just did by looking at the summary of your commit:

summary(commits(repo)[[1]])
## Commit:  0d26046ed825d9dcab8086a88d86be808f0814ff
## Author:  Zhian Kamvar <kamvarz@science.oregonstate.edu>
## When:    2017-01-04 14:47:27
## 
##       started spatial stats vignette and added data
##      
##      I copied the template to the use folder and placed
##      spatial data in the data folder.
##      
##      
##      sessionInfo:
##      R version 3.3.2 (2016-10-31)
##      Platform: x86_64-pc-linux-gnu (64-bit)
##      Running under: Debian GNU/Linux 8 (jessie)
##      
##      locale:
##       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##       [9] LC_ADDRESS=C               LC_TELEPHONE=C            
##      [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
##      
##      attached base packages:
##      [1] methods   stats     graphics  grDevices utils     datasets  base     
##      
##      other attached packages:
##      [1] git2r_0.16.0
##      
##      loaded via a namespace (and not attached):
##       [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.2    
##       [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.8     stringi_1.1.2  
##       [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.10  
##      [13] evaluate_0.10  
## 2 files changed, 127 insertions, 0 deletions
## data/spatial_data.csv            | -0 +  6  in 1 hunk
## use/2015-12-16-spatial-stats.Rmd | -0 +121  in 1 hunk

Now you can see that your commit message was recorded in the repository. Since we see that there is a Remote associated with this fork, we can simply use push() to push our changes up to the fork.

push(repo, credentials = cred_token())

Step 4: Pull Request and Peer Review

Now that you’ve successfully set up git, created a new branch, and created a new vignette, it’s time to fill it with content. You can work in your vignette like you would write in any other Rmarkdown document and keep track of your changes like we showed above by adding and committing your changes followed by pushing these up to your fork. Once you are finished with all of your changes, you can create a pull request from your GitHub fork to NESCent.

Creating a pull request

Through R

You can tell R to open a browser for you to create your pull request. The URL for this is in the form of https://github.com/USER/popgenInfo/pull/new/BRANCH. We can create this URL by using the function paste() to glue the pieces together and then use the utils::browseURL() function to open up a window. Note we are using the variable called BRANCH that we named above.

(my_account <- dirname(remote_url(repo, "origin")))
## [1] "https://github.com/zkamvar"
(pull_request_url <- paste(my_account, "popgenInfo/pull/new", BRANCH, sep = "/"))
## [1] "https://github.com/zkamvar/popgenInfo/pull/new/2015-12-16-spatial-stats"

Now that you have the URL, you can browse to it.

utils::browseURL(pull_request_url)

Manually

If you don’t want to do this from R, you can do it through your web browser.

  1. Navigate to your fork on GitHub
  2. Click on the dropdown menu on the left that says “Branch: master” and switch to the branch you want to create the pull request from.
  3. Click on the green button that says “New Pull Request”.

Peer Review

Your pull request begins a process of open peer review where content, accuracy, and style are considered. If changes are suggested, you should revise them by making changes on your fork and repeating the process in Step 3. Your pull request will be automatically updated once you push the changes to your fork.

Finally, when all questions and concerns have been addressed, the pull request may be merged into the NESCent repository as long as two of the maintainers have signaled their approval.

Conclusions

From this tutorial, you should have learned how to:

The skills demonstrated in this tutorial are not exclusive for popgenInfo, but they can be used when writing up your own reports or analyses. Using git may not feel comfortable at this moment, but just like it takes practice to learn how to keep a lab notebook, with practice, you will become comfortable using git to keep track of your workflows, making your science more reproducible.

What’s Next?

Once your pull request passes peer review and is published, you can update your fork by starting from Step 1: checkout your master branch, fetch the changes from NESCent, merge those changes, and push them up to the master branch on your fork to make it even with NESCent.

Contributors

Session Information

This shows us useful information for reproducibility. Of particular importance are the versions of R and the packages used to create this workflow. It is considered good practice to record this information with every analysis.

options(width = 100)
devtools::session_info()
## Session info ---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2017-01-04
## Packages -------------------------------------------------------------------------------------------
##  package   * version date       source        
##  backports   1.0.4   2016-10-24 CRAN (R 3.3.2)
##  devtools    1.12.0  2016-12-05 CRAN (R 3.3.2)
##  digest      0.6.10  2016-08-02 CRAN (R 3.3.2)
##  evaluate    0.10    2016-10-11 CRAN (R 3.3.2)
##  git2r     * 0.16.0  2016-11-20 CRAN (R 3.3.2)
##  htmltools   0.3.5   2016-03-21 CRAN (R 3.3.2)
##  knitr       1.15.1  2016-11-22 CRAN (R 3.3.2)
##  magrittr    1.5     2014-11-22 CRAN (R 3.3.2)
##  memoise     1.0.0   2016-01-29 CRAN (R 3.3.2)
##  Rcpp        0.12.8  2016-11-17 CRAN (R 3.3.2)
##  rmarkdown   1.3     2016-12-21 CRAN (R 3.3.2)
##  rprojroot   1.1     2016-10-29 CRAN (R 3.3.2)
##  stringi     1.1.2   2016-10-01 CRAN (R 3.3.2)
##  stringr     1.1.0   2016-08-19 CRAN (R 3.3.2)
##  withr       1.0.2   2016-06-20 CRAN (R 3.3.2)
##  yaml        2.1.14  2016-11-12 CRAN (R 3.3.2)