I recently changed industries and joined a startup company where I’m responsible for building up a data science discipline. While we already had a solid data pipeline in place when I joined, we didn’t have processes in place for reproducible analysis, scaling up models, and performing experiments. The goal of this series of blog posts is to provide an overview of how to build a data science platform from scratch for a startup, providing real examples using Google Cloud Platform (GCP) that readers can try out themselves.
This series is intended for data scientists and analysts that want to move beyond the model training stage, and build data pipelines and data products that can be impactful for an organization. However, it could also be useful for other disciplines that want a better understanding of how to work with data scientists to run experiments and build data products. It is intended for readers with programming experience, and will include code examples primarily in R and Java.
One of the first questions to ask when hiring a data scientist for your startup is
how will data science improve our product? At Windfall Data, our product is data, and therefore the goal of data science aligns well with the goal of the company, to build the most accurate model for estimating net worth. At other organizations, such as a mobile gaming company, the answer may not be so direct, and data science may be more useful for understanding how to run the business rather than improve products. However, in these early stages it’s usually beneficial to start collecting data about customer behavior, so that you can improve products in the future.
Some of the benefits of using data science at a start up are:
- Identifying key business metrics to track and forecast
- Building predictive models of customer behavior
- Running experiments to test product changes
- Building data products that enable new product features
Many organizations get stuck on the first two or three steps, and do not utilize the full potential of data science. The goal of this series of blog posts is to show how managed services can be used for small teams to move beyond data pipelines for just calculating run-the-business metrics, and transition to an organization where data science provides key input for product development.
Here are the topics I am planning to cover for this blog series. As I write new sections, I may add or move around sections. Please provide comments at the end of this posts if there are other topics that you feel should be covered.
- Introduction (this post): Provides motivation for using data science at a startup and provides an overview of the content covered in this series of posts. Similar posts include functions of data science, scaling data science and my FinTech journey.
- Tracking Data: Discusses the motivation for capturing data from applications and web pages, proposes different methods for collecting tracking data, introduces concerns such as privacy and fraud, and presents an example with Google PubSub.
- Data pipelines: Presents different approaches for collecting data for use by an analytics and data science team, discusses approaches with flat files, databases, and data lakes, and presents an implementation using PubSub, DataFlow, and BigQuery. Similar posts include a scalable analytics pipeline and the evolution of game analytics platforms.
- Business Intelligence: Identifies common practices for ETLs, automated reports/dashboards and calculating run-the-business metrics and KPIs. Presents an example with R Shiny and Data Studio.
- Exploratory Analysis: Covers common analyses used for digging into data such as building histograms and cumulative distribution functions, correlation analysis, and feature importance for linear models. Presents an example analysis with the Natality public data set. Similar posts include clustering the top 1% and 10 years of data science visualizations.
- Predictive Modeling: Discusses approaches for supervised and unsupervised learning, and presents churn and cross-promotion predictive models, and methods for evaluating offline model performance.
- Model Production: Shows how to scale up offline models to score millions of records, and discusses batch and online approaches for model deployment. Similar posts include Productizing Data Science at Twitch, and Producizting Models with DataFlow.
- Experimentation: Provides an introduction to A/B testing for products, discusses how to set up an experimentation framework for running experiments, and presents an example analysis with R and bootstrapping. Similar posts include A/B testing with staged rollouts.
- Recommendation Systems: Introduces the basics of recommendation systems and provides an example of scaling up a recommender for a production system. Similar posts include prototyping a recommender.
- Deep Learning: Provides a light introduction to data science problems that are best addressed with deep learning, such as flagging chat messages as offensive. Provides examples of prototyping models with the R interface to Keras, and productizing with the R interface to CloudML.
Throughout the series, I’ll be presenting code examples built on Google Cloud Platform. I choose this cloud option, because GCP provides a number of managed services that make it possible for small teams to build data pipelines, productize predictive models, and utilize deep learning. It’s also possible to sign up for a free trial with GCP and get $300 in credits. This should cover most of the topics presented in this series, but it will quickly expire if your goal is to dive into deep learning on the cloud.
For programming languages, I’ll be using R for scripting and Java for production, as well as SQL for working with data in BigQuery. I’ll also present other tools such as Shiny. Some experience with R and Java is recommended, since I won’t be covering the basics of these languages.