This post describes the start of a journey to get pandas' documentation running on Binder. The end result is this nice button:
For a while now I've been jealous of Dask's examples repository. That's a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder.
A bit of background on binder: it's a tool for creating a shareable computing environment. This is perfect for introductory documentation. A prospective user may want to just try out a library to get a feel for it before they commit to installing. BinderHub is a tool for deploying binder services. You point a binderhub deployment (like mybinder) at a git repository with a collection of notebooks and an environment specification, and out comes your executable documentation.
Thanks to a lot of hard work by contributors and maintainers, the code examples in pandas' documentation are already runnable (and this is verified on each commit). We use the IPython Sphinx Extension to execute examples and include their output. We write documentation like
.. ipython:: python import pandas as pd s = pd.Series([1, 2]) s
Which is then executed and rendered in the HTML docs as
In : import pandas as pd In : s = pd.Series([1, 2, 3]) In : s Out: 0 1 1 2 2 3 dtype: int64
So we have the most important thing: a rich source of documentation that's already runnable.
There were a couple barriers to just pointing binder at
https://github.com/pandas-dev/pandas, however. First, binder builds on top of
a tool called repo2docker. This
is what takes your Git repository and turns it into a Docker image that users
will be dropped into. When someone visits the URL, binder will first check to
see if it's built a docker image. If it's already cached, then that will just be
loaded. If not, binder will have to clone the repository and build it from
scratch, a time-consuming process. Pandas receives 5-10 commits per day, meaning
many users would visit the site and be stuck waiting for a 5-10 minute docker
Second, pandas uses Sphinx and ReST for its documentation. Binder needs a collection
of Notebooks. Fortunately, the fine folks at QuantEcon
(a fellow NumFOCUS project) wrote
sphinxcontrib-jupyter, a tool
for turning ReST files to Jupyter notebooks. Just what we needed.
So we had some great documentation that already runs, and a tool for converting ReST files to Jupyter notebooks. All the pieces were falling into place!
Unfortunately, my first attempt failed.
sphinxcontrib-jupyter looks for directives
.. code:: python
while pandas uses
.. ipython:: python
I started slogging down a path to teach
sphinxcontrib-jupyter how to recognize
the IPython directive pandas uses when my kid woke up from his nap. Feeling
dejected I gave up.
But later in the day, I had the (obvious in hindsight) realization that we have
plenty of tools for substituting lines of text. A few (non-obvious) lines of
and we were ready to go. All the
.. ipython:: python directives were now
code:: python. Moral of the story: take breaks.
- We include github.com/pandas-dev/pandas as a submodule (which repo2docker supports just fine)
- We patch pandas Sphinx config to include sphinxcontrib-jupyter and its configuration
- We patch pandas source docs to change the ipython directives to be
.. code:: pythondirectives.
I'm reasonably happy with how things are shaping up. I plan to migrate my repository to the pandas organization and propose a few changes to the pandas documentation (like a small header pointing from the rendered HTML docs to the binder). If you'd like to follow along, subscribe to this pandas issue.
I'm also hopeful that other projects can apply a similar approach to their documentation too.
I realize now that binder can target a specific branch or commit. I'm not sure if additional commits to that repository will trigger a rebuild, but I suspect not. We still needed to solve problem 2 though. ↩