GIT Essentials for a Data Scientist

Version Control 101

Version control is all about managing changes to files and directories by one or many contributors. Git is an incredibly popular system for version control and the one we will be running through for this course.

There are many benefits to version control, and Git specifically. Including a view of historical changes made to your project, automatic notification of conflicting work, where two individuals effectively write conflicting lines of code, allows for collaboration across many individuals which allows teams to grow.

What is a Repository?

You’ve likely heard the term many times… repository.

All of your data science projects manages with Git will have two main components. The first is all of the work your doing in association with files and directories.. your scripts, models, and where and how their stored; the other piece of this is the information that Git holds onto to maintain record of all of the changes that have been made to your project over time.

When you add those pieces together, you have yourself a repository, or as the cool kids call it… a repo 😉

Basic Commands For You to Know

Git status

Git status lets you know what is in the “staging area”

The staging area is where you put the files that you will be changing. It’s effectively you prepping a variety of letters and putting them in a box ready to send. Whether you want to remove things from here or add more is up to you, but the moment you hand them to the mailman there’s no getting them back. Those changes will take place. Git status will give you information about whatever file(s) are in the box ready to go to the main.

Git status

Git add

If you ran git status and found there was nothing in your staging area not to worry! You first need to add files to the staging area. You can do so with git add filename. Whatever filename you add here will be moved to the staging area. That means that all of the changes residing in a given file would be ready to push or be updated in the repo.

Git diff

Now we can see what file is in the staging area with git status, but what about the event where you want to see what has changed? You can use what’s called git diff. Git diff will return all of the differences between the original file and all of the changes to be made, denoting them as a and b respectively.

When running git diff, you might actually run git diff -r HEAD. HEAD will give you the most recent commit, and -r will make a comparison to a specific version of the file. If you want to see the changes of one file in particular, you can include the file path after HEAD. Something to the effect of git diff -r HEAD filepath

Git commit

Once you’ve added files to your staging area, you can put them in the mailbox with git commit. Keep in mind that anything in the ‘box’ gets shipped together as one unit. So if you want to undo anything about a given commit, you would have to roll back the entire commit.

A good best practice is to commit with good frequency.

One thing to keep in mind is you wont actually just run git commit. Your command will actually look like this git commit -m "model updates". This -m is your log message. Best practice here is to be specific and descriptive about the changes you’ve made to your project. You’ll thank yourself later!

Git log

Now the last command I’ll talk about for now is git log/

git log is where you can pull up your repository’s history of commits. It provides a handful of pieces of information like the author, commit date, and log message.


I hope this proves a useful crash-course on git! Git your hands dirty with those commands to get yourself and your data science teams using Git more effectively!

Happy data science-ing!


Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: