Software engineers have found ways to collaborate and work in parallel since the dawn of the industry. Engineers with similar backgrounds can improve their collaboration by using established practices, such as version control, waterfall or agile, and dedicated frameworks.
With the proliferation of big data and the need to build sophisticated data products, a new need for collaboration has emerged in the past few years. One between the more classic software engineer and a new type of scientist: a domain expert with a strong background in mathematics who has picked up programming.
Introducing the data scientist.
Building data products involves the process of data exploration without clear specifications, which is an significant difference from building web apps or desktop software. In addition, software engineers and data scientists differ in ways that make communication more difficult, as summarized in this table:
If both parties are facilitated to play to their strengths and complement each other’s weaknesses, magic starts to happen. People can build things greater than the sum of each individual's abilities. With my experience working in many data-focused companies and projects, I have found three main patterns to reach this goal.
While sharing the production data with the data scientists, one of the following two inconvenient situations can emerge:
- Data scientists having too little access and needing to explicitly request and wait for data exports or have small exports sent to them via email with fields they might have needed for their exploration 'cleaned out',
- Data scientists having too much access and running queries that affect the production database.
The way to solve this is to define a way to share all the raw data with the data scientists in an environment separate from production. This pattern is known as the data lake. The key idea behind it is that, because we don't know what data might be needed in the future, we store everything in a flat way in a location easily accessible by the data scientists. Hence the metaphor of a clear lake with vast depths of storage.
Then, it's up to the data scientists to build a more hierarchical way of storing the data on top of that at a later time if it is needed. This new way of storing will be optimized for the queries they wish to run in order to explore new ideas if and when these ideas come. This means that the same data may end up being stored in the data lake multiple times, in various stages of being transformed.
Assuming the data comes from a main application that the software engineers are working on, it is important for data scientists to work together with them to define a way for all this data to flow into the lake in its raw form.
The following things need to be defined:
- Whether streaming data or daily/weekly dumps are needed for each of the data sources,
- The data format (for example, JSON), the schema, and whether compression will be used,
- Where each team's responsibility ends, how to monitor data flow, and who is responsible for that.
It's better if the engineers have minimum involvement in how the data ends up being stored on disk and if the schema for the data is flexible as they are not the experts in querying the data. The data scientists are.
One concrete way I have used to connect streaming data from a web app to the data lake is to have the web app push data into a kinesis stream, which is an Amazon Web Services component. This is also where the involvement of a software engineer ends. The data science team would then be responsible for parsing the data from the stream (which have some required and optional fields), and storing it whatever way they see fit, after augmenting or cleaning the data. Data scientists also have dashboards to monitor if there are any anomalies in the data flow.
Data scientists tend to work with one-off scripts that contain SQL queries or pandas code, for example. For their next assignment, they might copy paste bits from a previous script into another. Copy and pasting raw code to modify it slightly or re-writing things that one has written before clutters the brain for no good reason and can prevent new ideas from emerging. Higher level abstractions are needed. For example, storing procedures in SQL instead of raw SQL code, or a toolbox library of common transformations instead of inline code that only lives in Jupyter notebooks.
A way to build this kind of library is to allocate time each week to work on it, as data scientists will gradually understand what kind of transformations they need to do often.
What some data companies do is have a software engineer embedded in the data science team. The software engineer can review the new code being written, and locate opportunities for adding new functionality into the data science toolbox. Basically using their talent in writing modular software to support the team. A 'wrong' approach to do this is trying to define all the requirements for the toolbox upfront. It’s important that this is done over time. The data science team will have a better understanding of what tools they need as they explore the data, and and the more understanding of what kind of analytics bring the most business value, which brings me to the third pattern.
The output of a data science team is algorithms that yield information out of the raw data. The process is not complete when we've reached the information stage, however. We want to go all the way from raw data to information to value. We want to see if today’s algorithm is better than the one we had yesterday and is according to a metric that makes sense for the business. This is hard to do if there’s no process as the following situations can occur:
- Data scientists produce a report on the business metrics, but upon inspection, it is hard to find out which input/commit combination produced it,
- A lot of time is spent on finding out why results that were generated “the
same way” are different,
- Data scientists spend most of their time producing reports for different
combinations of inputs/commits manually.
There is an underlying need for a process to continuously evaluate data science algorithms. This process needs to built into the fundamentals of the product, and not just added as an afterthought. All successful companies that rely on data have built that process in their product in some way. It may be done offline with historical data or online by showing different algorithms to different users. It can also be done as a combination of the two, where all algorithms are ran on historical data and the most promising ones are selected for the more expensive online experiments.
Another idea that is gaining traction is using Jupyter notebooks to evaluate algorithms offline. These notebooks, originally created by data scientists as one-off scripts, can be used as parameterizable 'jobs' to be ran on historical data. Or as report templates, if you will. The analytics platform Databricks also supports the the ability to run Python and Scala notebooks as regular jobs.
The goal is to embed the work of the data scientist in a larger data pipeline tool that combines datasets and different versions of algorithms to generate results. The plumbing and infrastructure work needed to set this up creates a good opportunity for collaboration. Specifically, the opportunity to build a system of continuous evaluation by combining the software engineers skill set of building large systems in concert with the skills that data scientists have in asking the right data questions.
The patterns presented here can be implemented in different ways and optionally, with the support of frameworks and platform-as-a-service providers. It is important to remember the underlying spirit of this practice, which is to enable people to work on what they are best at, to have an environment reasonably free of restrictions where creativity can emerge, and to build trust over time as new knowledge is acquired and invested back into the product.
Building software is not only about the new feature requirements, it is also, about the diverse requirements of the people working on it. Next time there's a requirements meeting, it's worth sitting down to discuss what each person needs in order to work in a sustainable way and to hear each other's pain points. And then, to be open-minded about what the solutions can be.
Something like that.
This article is built on a talk I gave at PyData Berlin last year.