Software Version Control with git


The ability to save previous versions of code, and work collaboratively within a team of data scientists, is essential to data science. All professional programmers and data scientists alike use software version control to track changes to their code.

The most widely used software version control tool is git, which models your changes as a graph, and has been popularized by github.com. There are three major steps to using git:

  1. Cloning a repository from the cloud to your own computer,
  2. Synchronizing changes on the cloud to your local cloned copy,
  3. Pushing changes you made to the files locally back to the cloud.

Cloning a Repository

Any time you begin working on a project with any existing code, you will need to clone a copy of the existing source code repository. This requires running a git clone command:

git clone <URL>

You'll usually only run this command once per project.

Synchronizing Changes

Each time, before you begin working on a project, it is best practice to make certain you have the latest code. Synchronizing your local clone with the (possibly) updated code on the cloud repository is quick and completed with a git pull:

git pull

You'll usually only run this command every time you start working.

Merging New Code

When someone else has been working on code that contains many changes, it is common to use a branch that contains all of the related changes. In DISCOVERY, we have a branch for each lab and project. Before starting to work on a new lab or project, you will need to merge in the starting files from the branch we've prepared. This will be done with a git merge:

git merge <branch_name>

You'll run this command every time you are adding new major components (ex: new assignment) to your repository.

Pushing Changes

When you have finished working, you need to "push" your changes to the cloud. For complex projects, you may push files to feature-specific branches or have rules on the style and frequency of your commits. However, for simple projects and starting out, you'll often want to commit all your changed files. This requires a three step process and a "commit message" that describes what changes you have made:

# Step 1 - Adds all changed files:
git add -A
# Step 2 - Adds a commit message to the changed files:
git commit -m "your commit message"
# Step - Push the changes to the cloud:
git push