I, like many other Data Scientists, love the Jupyter Notebook. It’s pretty, it’s convenient, it lets you share your analyses with other people. By putting together code, output, and documentation, it (can) be the very embodiment of literate programming.
I also love git. I mean, who wouldn’t, right? A VCS has obvious advantages for programming in general, but in scientific programming in particular, it has even more. It ties into the ideals of transparency and reproducibility by allowing you to precisely specify what version of a piece of code produced what results. And it allows for asynchrounous collaboration on text documents, which fixes the draft_v2_final_definitive_thistimeforreal.docx
problem (any biologist reading me? we are so behind other fields in this…). Like a superhero, it even has a pretty cool origin story!
So it is a disappointment that my two favorite children do not play well together. Whenever you run a notebook cell, that will generate a diff due to the execution number changing. Often, you don’t even need to run it to create changes: just opening a notebook with different version of Python will change the metadata in it. That’s annoying. However, an even bigger problem is version controlling the ouputs– that’ll generate many, many lines of diff every time you minimally modify any graph, for example.
Does that mean we have to choose between the beauty and clarity of literate programming on the one hand, and the freedom to experiment and collaborate that git provides on the other?
Of course not! there is a solution. I’m writing this up because the easiest-to-find solution, by Pascal Bugnion, is now broken because he used a gist that was not updated and broke with some update to the notebook format. His solution is in turn based on this stackoverflow answer.
I’m basically going to quickly rehash what he wrote and update it. An aside: although the notebook is saved as json, if I was to write something like that gist I’d use a library designed specifically for handling notebooks. I have no idea whether it was available when he wrote that, though.
Our strategy to version control notebooks comprises two parts: git filters and nbstripout.
A git filter will process files when they are added to git from the working directory (a clean
filter) or viceversa (a smudge
filter). If you want more details, go to Pascal’s post, it’s very good. Here we will use a clean
filter, in order to never commit outputs.
The other component is nbstripout. It’s pretty much the same as the script that Pascal put together, but it’s mantained so it shouldn’t break. It will take a .ipynb file and, surprise surprise, strip its output out. you can install it with:
pip install --upgrade nbstripout
If you are using Anaconda, which you definitely should be doing, install it with this instead:
conda install -c conda-forge nbstripout
Note that if you do it this way, the conda environment where you have nbstripout installed will need to be active before you can commit cleanly.
Once you have nbstripout installed, you’ll need to tell git to run it every time you add or diff, that is, to use it as a clean
filter. In Pascal’s post, he explains how to do this by touching the .gitattributes
File. Fortunately, the creators of nbstripout have automated the process so that you only need to do:
nbstripout --install --attributes .gitattributes
Fácil, sencillo, y para toda la familia, as they say in my country! Now every time you git add
a notebook, only the cells themselves will be added to the staging area. As simple as that. Remember though that nbstripout will need to be accesible when you do that! However, since you’ve been making changes, I’m guessing you have the appropriate environment activated, so it shouldn’t be a problem.
There are a lot of things that you can do with git filters, and even more that you can do with git hooks. Maybe I’ll post something about that some other time.