Data analysis can bring a competitive advantage to your business, assisting in a better understanding of the product, customers, and competitors. An integral part of data analysis is data visualization. It can provide valuable information and help with its comprehension and correct interpretation.
Today we will perform exploratory data analysis that gives an interesting insight into a small, simple dataset. The purpose of this article is to find states with the most active startup ecosystems.
For this analysis, we used data from Angel.co. Angel.co is a platform where job-seekers can look for a job at startups, as well as investors and companies can find each other for partnership. Using data from Angel.co for this kind of task has several advantages in comparison to using data from other sources, in particular:
- It is one of the biggest platforms for startups, where you can find information about a large number of startups.
- It is a free platform. Information is publicly available, and you don’t need to pay to get access to it.
- Good hierarchy, meaning that you can go deeper and get more granular information on particular states that interest you.
So, to start off, we created an Excel file with data that we will later analyze in Tableau. We tested many hypotheses, and only one of them showed a good result. Below, we describe our best hypotheses with the explanations to all of the steps.
Angel.co provides 4 measures for each state, namely:
- Companies - the number of startups in the state.
- Investors - the number of investors that invest in companies.
- Followers - the number of followers that interested in companies.
- Jobs - the number of jobs that are offered by companies in the state.
Exploratory data analysis
We are looking for the states with the best conditions for the development of startups, so we are most interested in Companies and Jobs measures.
To begin with, let’s build a bar chart. As you can see, there are only two states with a significantly larger number of companies comparing to other states: California and New York. And the number of jobs is also high there.
As a matter of fact, California is a leading IT area in the world where Silicon Valley is located, and New York is a leading business hub in the USA. So, these two states will be our target states for now.
These views are not representative. It is hard to see the difference between states with a small amount of companies. So, we plot absolute values of Сompanies & Jobs as a scatterplot, which helps to see the correlation between variables and clustering effects.
All of the states are situated near the diagonal line. And the further they are from the beginning of the axis, the more interesting they are for us. Our two target states are outliers in this scatter plot. Because of them, it is hard to see the other ones.
However, the picture still has some issues. We can’t understand if there are some states with a small number of companies that can be suitable for our goals. So we decide to weight measures by the number of companies. In this case, we can see the states with a small number of companies, but with a big amount of jobs per company. This means that, despite the small number of companies in this state, the companies are very active and interesting for us.
Improve your skills with Data Science School
Below, you can see the results of those calculations. Bar charts represent absolute values and dots represent weighted values.
At this time, we also use scatter plot; however, we use weighted values on it: Jobs per company on Y-axis & Investors per company on X-axis. Now, everything looks in its place. At the top left corner, we have our target states. They are almost on the Y-axis. So all states which are close to Y-axis also interest us.
As a result, the most interesting states are New York, California, Massachusetts, District of Columbia, Washington, Illinois, New Jersey, Colorado and others, highlighted in blue.
In this post, we performed an exploratory data analysis and found the states with the best startup ecosystems. We showed you how to distinguish inconspicuous at first glance states against the background of the dominating California and New York. With the help of basic visualizations, we retrieved useful information from a simple dataset using a small amount of data for every state.