drawing

Woman holding a balance, Vermeer 1664

What do you think of when you read the phrase ‘data science’? It’s probably some combination of keywords like statistics, machine learning, deep learning, and ‘sexiest job of the 21st century’. Or maybe it’s an image of a data scientist, sitting at her computer, putting together stunning visuals from well-run A/B tests. Either way, it’s glamorous, smart, and sophisticated. This is the narrative that data science has been selling since I entered the field almost ten years ago.

I started out as a data analyst.

While I was still mostly waiting for SQL queries to finish and cleaning extremely dirty and sad Excel files, I was reading Hacker News posts about mining massive datasets, Facebook’s hot new data science team, and Hal Varian, and dreaming.

In 2012, I lucked out by being put on an analytics/engineering team that was transitioning some of its ETL processes from Oracle to Hadoop to keep up with data throughput.

I volunteered to be the first of the analysts to work with Pig and Hive, mainly because I was too impatient to wait for engineering work to be completed before having access to my data. But also, I was starstruck by this glowy, mysterious aura around data scientists - people who performed cool experiments, presented cool analyses, and got to have a MacBook for work.

I wanted to be one of those people! So, I learned Python online, brushed up on all of the statistics I’d taken in undergrad, and made a lot of command line mistakes in working with HDFS. In those early years, there was no real formalized way to learn “data science,” other than to see what everyone else was doing, go to meetups, and try to read the tea leaves from HR job descriptions.

After fumbling on my own for a very long time, I’ve now been established in “data science” for the past 6 years, and, to serve as the mentor that I didn’t have, I’ve been answering emails and having coffee meetings with people looking for advice to get into data science.

Since 2012, the data science industry has moved extremely quickly. It’s gone through almost every stage in the Gartner hype cycle.

We’ve been through the early adoption phase, the negative press around AI and bias, the second and third rounds of venture capital for companies like Facebook, and are now at the point of high-growth adoption: where banks, healthcare companies, and other Fortune 100 companies that move five years behind the market are also hiring for data science in machine learning.

A lot has changed. Big Data (remember Hadoop? and Pig?) is out. R has seen a meteoric rise in adoption. Python was written up in the Economist. Then the cloud changed everything all over again.

Unfortunately, what has not changed is the mass media hype around the field of data science, which has trumpeted data scientist as the ‘sexiest career of the 21st century’ so many times, that there is now what I believe to be an important problem that we as a community need to talk about. That problem is an oversupply of junior data scientists hoping to enter the industry, and mismatched expectations on what they can hope to find once they do get that coveted title of “data scientist.”

Glut of new data scientists

First, let’s talk about the oversupply of junior data scientists. The continuing media hype cycle around data science has enormously exploded the amount of junior talent available on the market over the past five years.

This is purely anecdotal evidence, so take it with a large grain of salt. But, based on my own participation as a resume screener, mentor to data scientists leaving boot camps, interviewer, interviewee, and from conversations with friends and colleagues in similar positions, I’ve developed an intuition that the number of candidates per any given data science position, particularly at the entry level, has grown from 20 or so per slot, to 100 or more. I was talking to a friend recently who had to go through 500 resumes for a single opening.

This is not abnormal. More anecdotal evidence comes from job openings like this one, from machine learning’s godfather, Andrew Ng, whose AI startup demanded 70-80 hours a week. He was flooded with applications, after blithely noting that previously many people had tried to volunteer for free. As of this latest writing, they ran out of space in their current office.

It’s very, very hard to estimate the true gap between market demand and supply, but here’s a starting point.

A study of job ads from April found more than 10,000 vacancies in the US for people with AI or machine-learning skills.

The article goes on to note,

More than 100,000 people have started a deep learning course offered by Fast.ai, a startup focused on widening use of AI.

Assuming an average MOOC completion rate of around 7%, that would mean 7,000 people are available to fill those 10,000 jobs. For a single year. But, how about next year. Are we assuming a steady rate of data science job creation? If anything, the data science job market as such looks set to shrink, in line with my personal expectations.

Looking at a larger study, LinkedIn says there are 151,717 people with data science skills missing in the market. Although it’s unclear whether this directly means data scientists or just people with some subset of those skills, let’s assume that it’s the former. So, there are 150,000 vacancies for data scientists in the country.

Given that there are 100,000 that have started a data science course, let’s assume again that 7,000 of these finish.

But, neither of those numbers is taking into account all of the programs and avenues for creating new data science candidates: MOOCs outside of Fast.ai like Coursera, over 10 nationwide bootcamps like Metis and General Assembly that have cohorts of 25 people every 12 weeks, remote degrees from places like UCLA, on-site undergraduate degrees in analytics and data science, YouTube, and more. There are also a large amount of PhDs who, unable to find jobs in an extremely tight job market, are migrating from academia to data science.

Here’s a third corroborating account,which noted that, in 2015, there were 40k job openings for data scientists. It estimated in general that the market supply for analytics skills (again, a much larger swath than data science, but still a point of comparison), would overcrowd the market by 2018.

The amount of junior talent entering data science programs. Combine this with the hundreds of bootcamps putting on data science curricula, and, as someone looking for an industry to enter, you’re looking at a perfect storm.

On top of the gut feel that I have from working in the industry and talking to 100+ people who also do, these two tweets finally convinced me that there is a true data science supply bubble. First, this intro class tweet:

and UVA starting up a data science school.

Since academia is typically a lagging indicator in adoption to new trends in the work place, it’s been long enough that it’s truly worrying for junior data scientists, all of who are hoping to find data science positions. It can be very hard for someone with a new degree in data science to find a data science position, given how many new people they’re competing with in the market.

This wasn’t the case even three, four years ago, but now that data science has changed from a buzzword to something even larger companies outside of the Silicon Valley bubble hire for, positions have not only become more codified, but with more rigorous entry requirements that will prefer people with previous data science experience every time. Data science interviews are still very hard to get right, and still a complete mismatch for jobs.

As many blog posts point out, you won’t necessarily land your dream job on the first try. As a result, the market can be very hard, and very discouraging for the flood of beginners.

Data science as a misleading job req

The second issue is that once these junior people get to the market, they come in with an unrealistic set of expectations about what data science work will look like. Everyone thinks they’re going to be doing machine learning, deep learning, and Bayesian simulations.

This is not their fault; this is what data science curriculums and the tech media emphasize. Not much has changed since I first glanced, starry-eyed, at Hacker News logistic regression posts many, many moons ago.

The reality is that “data science” has never been as much about machine learning as it has about cleaning, shaping data, and moving it from place to place.

A recent, extremely non-scientific survey I did confirms this: