Recently, I started a new project using Scala as primary language to build Spark Jobs. The project runs on the AWS EMR infrastructure and data science investigations are performed in Zeppelin notebooks hosted on S3. We peer review all our data science deliverables and it quickly became clear that reviewing notebooks is not as easy as reviewing regular code. To give an example, this is how a Zeppelin notebook looks like in a BitBucket pull request:
The same applies to Jupyter notebooks since also their raw format is JSON.
Further, being notebooks primarily code, I would prefer to handle them as any other piece of code in the organisation to keep our processes as few and as lean as possible. E.g., “all source code and notebooks are peer-reviewed and versioned on BitBucket”. In the last years, I saw notebooks attached as comments to JIRA tickets, sent around by email, or even worse, lost somewhere in chaotic S3 buckets. This is definitely something you don’t want to experience! trust me.
So, how can we handle notebooks as close as possible to regular source code?
Markdown documents are nicely rendered on web versioning tools and they are also easy to review since they are plain readable code similar to LateX (but waaay simpler). I built a small command-line tool in Python3 that does exactly this: it converts both Zeppelin and Jupyter notebooks to readable and reviewer-friendly Markdown documents. An example of Zeppelin notebook converted to the Markdown format being reviewed on BitBucket looks like this:
As you can see, cell code and outputs are rendered. visual outputs are not rendered, but the reviewer can still comment on their corresponding code cells while rendering it in another tab. After the reviewing process, the notebook remains nicely accessible as Markdown document. E.g.:
nb2md tool can read both Zeppelin and Jupyter notebooks from S3, HTTP and local paths. Try it out!
pip install nb2md (requires Python3).
The official documentation and code are available at https://github.com/elehcimd/nb2md/. It is less than 300 lines of code (less than the code required for packaging) so it is also easy to modify/extend it.