## Sunday, May 10, 2015

### Using Git and GitHub to Archive Data

This blog post is for those of you who have never used Git or GitHub.   I use Git and GitHub to archive my behavioral data.    These data are uploaded to GitHub, an open web repository where it may be viewed by anyone at any time without any restrictions.  This upload occurs nightly, that is, the data are available within 24 hours of their creation.  The upload is automatic---no lab personnel is needed to start it or approve it.  The upload is comprehensive in that all data files from all experiments are uploaded, even those that correspond to aborted experimental runs or pilot experiments.  The data are uploaded with time stamps and with an automatically generated log.  The system is versioned so that any changes to data files are  logged, and the new and old versions are saved.  In summary, if we collect it, it is there, and it is transparent.   I call data generated this way as Born Open Data.

Since setting up the born-open-data system, I have gotten a few queries about Git and GitHub, the heart of the system.  Git is the versioning software; GitHub is a place on the web (github.com) where the data are stored.  They work hand in hand.

In this post, I walk through a few steps of setting up GitHub for archiving.  I take the perspective of Kirby, my dog, who wishes to archive the following four photos of himself:

Here are Kirby's steps:

1. The first step is to create a repository on the GitHub server.

1a.  Kirby goes to GitHub (github.com)  and signs up for a free account (last option).  Once the account is set up (with user name KirbyHerby) he is given a screen with a lot of options for exploring GitHub.  He ignores these as they are not relevant for his task.

1b.  To create his first repository on the server, Kirby presses the green button that says + New repository" on the bottom left.

1c. Kirby now has to make some choices about the repository.  He names it data," enters a description of the repository, makes it public,
initializes it with a README and does not specify which files to ignore or a license.  He then presses the green Create repository" button on the bottom, and is given his first view of the repository

Kirby's repository is now at github.com/KirbyHerby/data, and he will bark out this URL to anyone interested.  The repository contains only the README.md file at this point.

2.  The next step is getting a linked copy of this repository on Kirby's local computer.

2a. Kirby  downloads the GitHub application for his operating system (mac.github.com} or windows.github.com), and on installation, chooses to install the command-line tools (trust me, you will use these some day).

2b.  Kirby enters his GitHub username (KirbyHerby") and password.

2c. He next has to create a local repository and link it to the one on the server.   To do so, he chooses to Add repository" and is given a choice to Add," Create," or Clone."  Since the repository already exists at GitHub, he presses Clone."  A list of his repositories shows up, and in this case, it is a short list of one repository, data."   Kirby then selects data" and presses the bottom button Clone repository."  The repository now exists on the local computer under the folder data."   There are two, separate copies of the same repository: one on the GitHub server and one on Kirby's local machine.

3. Kirby wishes add files to the server repository so others may see them.

3a. Kirby first adds the photo files to the local repository as follows: Kirby copies the photos to the files in the usual way, which for Mac-OSX is by using the Finder.  The following screen shot shows Finder window in the foreground and the GitHub client window in the background.  As can be seen, Kirby has added three files, and these show up in both applications.  Kirby has no more need for the Finder and closes it to get a better view of the local repository in the GitHub client window.

3b.  Kirby is now going to save the updated state of the local repository, which is called committing it.  Committing a local action, and can be thought of as a snapshot of the repository at this point in time.  Kirby turns his attention to the bottom part of the screen.  To commit, Kirby must add a log entry, which in this case is, Added three great photos."  The log will contain not only this message, but a description of what files were added, when, and by whom.  This log message is enforced---one cannot make a commit without it.  Finally Kirby presses Commit to master."

3c.  Kirby now has to push his changes to the repository to the GitHub server so everyone may see them.  He can do so by pressing the sync" button.

That's it.  Kirby's additions are now available to everyone at github.com/KirbyHerby/data

Suppose Kirby realizes that he had forgotten his absolutely favorite photo of him hugging his favorite toy, Panda.  So he copies the photo over in Finder, commits a new version of the repository with a new message, and syncs up the local with the GitHub server version.

There is a lot more to Git and GitHub than this.  Git and GitHub are very powerful, so much so that they are the default for open-source software development world wide.  Multiple people may work on multiple parts of the same project.  Git and GitHub have support for branches, tagging versions, merging files, and resolving conflicts.  More about the system may be learned by studying the wonderful Git Book at git-scm.com/book/en/v2.

Finally, you may wonder why Kirby wanted to post these photos.  Well, Kirby doesn't know anything about Bayesian statistics, but he is loyal.  He knows I advocate Bayes factors.  He also knows that others who advocate ROPEs and credible intervals sell their wares with photos of dogs.  Kirby happens to believe that by posting these, he is contributing to my Bayes-factor cause.  After all, he is cuter than Kruschke's puppies and perhaps he is more talented.  He does know Git and GitHub and has his own repository to prove it.