How to Do Data Science Using SQL on Raw JSON


Data scientists and analysts deal with complex data. Much of what they analyze could be third-party data, over which there is little control. In order to make use of this data, significant effort is spent in data engineering. Data engineering transforms and normalizes high-cardinality, nested data into relational databases or into an output format that can then be loaded into data science notebooks to derive insights. At many organizations, data scientists or, more commonly, data engineers implement data pipelines to transform their raw data into something usable.

Data pipelines, however, regularly get in the way of data scientists and analysts getting to insights with their data. They are time-consuming to write and maintain, especially as the number of pipelines grows with each new data source added. They are often brittle, don't handle schema changes well, and add complexity to the data science process. Data scientists are typically dependent on others—data engineering teams—to build these pipelines as well, reducing their speed to value with their data.

data pipelines

Analyzing Third-Party Data to Support Investment Decisions

We've had the opportunity, at Rockset, to work with a number of data scientists and analysts in investment management firms, who are analyzing complex data sets in order to support investment decisions. They increasingly bring in alternative, third-party data—app usage, website visits, people hired, and fundraising—to enhance their research. And they use this data to evaluate their existing portfolio and source new investment opportunities. The typical pipeline for these data sets includes scripts and Apache Spark jobs to transform data, relational databases like PostgreSQL to store the transformed data, and finally, dashboards that serve information from the relational database.

In this blog, we take a specific example where a data scientist may combine two data sets—an App Annie data set that has statistics of mobile app usage and engagement, and a Crunchbase data set that tracks public and private companies globally. We will use SQL to transform the two data sets and then join them together to derive some interesting insights, all without any prior preparation or transformation.

Understanding the Shape of the JSON Data Set

We begin by loading the App Annie dataset into a Rockset collection named app_annie_monthly. App Annie data is in the form of JSON, and has up to 3 levels of nested arrays in it. It has descriptions of fields in columns, including statistics of Monthly Active Users (MAU) that we'll be using later. The rows contain the data corresponding to those columns in the description.

app annie json

Following this, we can set up our Jupyter notebook configured to use our Rockset account. Immediately after setup, we can run some basic queries on the data set that we have loaded.

DS without pipelines 1

Querying the Data

Once we have understood the overall structure of the data set, we can start unpacking the parts we're interested in using the UNNEST command in SQL. In our case, we care about the app name, the percentage increase in MAU month over month, and the company that makes the app.

DS without pipelines 2

Once we have gotten to this table, we can do some basic statistical calculations by exporting the data to dataframes. Dataframes can be used to visualize the percentage growth in MAU over the data set for a particular month.

DS without pipelines 3

Joining with an Additional Data Set

Now we can create the crunchbase_funding_rounds collection in Rockset from CSV files stored in Amazon S3. This is a fairly simple CSV file with many fields. We are particularly interested in some fields: company_name, country_code, investment_type, investor_names, and last_funding. These fields provide us additional information about the companies. We can join these on the company_name field, and apply a few additional filters to arrive at the final list of prospects for investment, ranked from maximum to minimum increase in MAU.

%%time
%%sql WITH -- # compute application statistics, MAU and percent change in MAU.
appStats AS ( SELECT rows.r[2][1]."name" AS app, rows.r[2][1]."company_name" AS company, rows.r[4][1] AS mau, rows.r[4][4] AS mau_percent_change FROM app_annie_monthly a, unnest(a."data"."table"."rows" AS r) AS rows WHERE a._meta.s3.path LIKE 'app\_annie/monthly/2018-05/01/data/all\_users\_top\_usage\_US\_iphone\_100\_%' ), -- # Get list of crunchbase orgs to join with.
crunchbaseOrgs AS ( SELECT founded_on AS founded_on, uuid AS company_uuid, short_description AS short_description, company_name as company_name FROM "crunchbase_organizations" ), -- # Get the JOINED relation from the above steps.
appStatsWithCrunchbaseOrgs as ( SELECT appStats.app as App, appStats.mau as mau, appStats.mau_percent_change as mau_percent_change, crunchbaseOrgs.company_uuid as company_uuid, crunchbaseOrgs.company_name as company_name, crunchbaseOrgs.founded_on as founded_on, crunchbaseOrgs.short_description as short_description FROM appStats INNER JOIN crunchbaseOrgs ON appStats.company = crunchbaseOrgs.company_name ), -- # Compute companyStatus = (IPO|ACQUIRED|CLOSED|OPERATING)
-- # There may be more than one status associated with a company, so, we do the Group By and Min.
companyStatus as ( SELECT company_name, min( case status when 'ipo' then 1 when 'acquired' then 2 when 'closed' then 3 when 'operating' then 4 end
) as status FROM "crunchbase_organizations" GROUP BY company_name ), -- # JOIN with companyStatus == (OPERATING), call it ventureFunded
ventureFunded as (SELECT appStatsWithCrunchbaseOrgs.App, appStatsWithCrunchbaseOrgs.company_name, appStatsWithCrunchbaseOrgs.mau_percent_change, appStatsWithCrunchbaseOrgs.mau, appStatsWithCrunchbaseOrgs.company_uuid, appStatsWithCrunchbaseOrgs.founded_on, appStatsWithCrunchbaseOrgs.short_description
FROM appStatsWithCrunchbaseOrgs INNER JOIN companyStatus ON appStatsWithCrunchbaseOrgs.company_name = companyStatus.company_name AND companyStatus.status = 4), -- # Find the latest round that each company raised, grouped by company UUID
latestRound AS ( SELECT company_uuid as cuid, max(announced_on) as announced_on, max(raised_amount_usd) as raised_amount_usd FROM "crunchbase_funding_rounds" GROUP BY company_uuid ), -- # Join it back with crunchbase_funding_rounds to get other details about that company
fundingRounds AS ( SELECT cfr.company_uuid as company_uuid, cfr.announced_on as announced_on, cfr.funding_round_uuid as funding_round_uuid, cfr.company_name as company_name, cfr.investment_type as investment_type, cfr.raised_amount_usd as raised_amount_usd, cfr.country_code as country_code, cfr.state_code as state_code, cfr.investor_names as investor_names FROM "crunchbase_funding_rounds" cfr JOIN latestRound ON latestRound.company_uuid = cfr.company_uuid AND latestRound.announced_on = cfr.announced_on
), -- # Finally, select the dataset with all the fields that are interesting to us. ventureFundedAllRegions
ventureFundedAllRegions AS ( SELECT ventureFunded.App as App, ventureFunded.company_name as company_name, ventureFunded.mau as mau, ventureFunded.mau_percent_change as mau_percent_change, ventureFunded.short_description as short_description, fundingRounds.announced_on as last_funding, fundingRounds.raised_amount_usd as raised_amount_usd, fundingRounds.country_code as country_code, fundingRounds.state_code as state_code, fundingRounds.investor_names as investor_names, fundingRounds.investment_type as investment_type FROM ventureFunded JOIN fundingRounds ON fundingRounds.company_uuid = ventureFunded.company_uuid) SELECT * FROM ventureFundedAllRegions
ORDER BY mau_percent_change DESC LIMIT 10

This final large query does several operations one after another. In order, the operations that it performs and the intermediate SQL query names are:

  • appStats: UNNEST operation on the App Annie dataset that extracts the interesting fields into a format resembling a flat table.
  • crunchbaseOrgs: Extracts relevant fields from the crunchbase collection.
  • appStatsWithCrunchbaseOrgs: Joins the App Annie and Crunchbase data on the company name.
  • companyStatus: Sets up filtering for companies based on their current status - IPO/Acquired/Closed/Operating. Each company may have multiple records but the ordering ensures that the latest status is captured.
  • ventureFunded: Uses the above metric to filter out organizations that are not currently privately held and operating.
  • latestRound: Finds the latest funding round—in total sum invested (USD) and the date when it was announced.
  • fundingRounds & ventureFundedAllRegions: Wrap it all together and extract other details of relevance that we can use.

Getting Insights on Prospective Investments

We can run one final query on the named query we have, ventureFundedAllRegions to generate the best prospective investments for the investment management firm.

DS without pipelines 4

As we see above, we get data that can help with decision making from an investment perspective. We started with applications that have posted significant growth in active users month over month. Then we performed some filtering to impose some constraints to improve the relevance of our list. Then we also extracted other details about the companies that created those applications and came up with a final list of prospects above. In this entire process, we did not make use of any ETL processes that transform the data from one format to another or wrangle it. The last query which was the longest took less than 4 seconds to run, due to Rockset's indexing of all fields and using those indexes to speed up the individual queries.