Why Python for Data Science is theĀ future

By Caroline Alexiou

Jump-start your career as a data scientist with Python

When it comes to data science, Python is a great language to master. First of all, it’s a language friendly to beginners — there are certain features in the language that make it easy to get started and develop prototypes quickly.

Moreover, Python has mature support for data science and countless other disciplines. That means that it can give flexibility to the data scientist to create solutions for data processing which are based on well-designed libraries and which can interface easily with other parts of the increasingly complex infrastructure of modern companies.

And lastly, it has a large, supportive and growing community and a great future — both in terms of jobs being created and libraries being constantly improved. More on these points follows below:

Python has been designed to read as close to a natural language as possible, which is one of the reasons beginners find it easy to grasp. It is quite expressive and succinct, meaning it does not require a lot of lines of code to communicate an idea. Python requires virtually no boilerplate to get started, and can be quite forgiving, not requiring the programmer to do upkeep such as memory management. With those hurdles out of the way, the data scientist can focus on expressing their data-related thinking with code.

Python also has a REPL (literally: a read-eval-print-loop shell). That means that the programmer can use it as a sandbox to run and inspect code quickly, without needing to always recompile and rerun everything which allows for rapid prototyping. The most popular version of the REPL used by data scientists right now is IPython.

In many ways, Python is the Swiss Army Knife of programming languages. It provides excellent libraries in a variety of disciplines directly related to data science such as statistics, machine learning, data processing, data visualization and many more in fields adjacent to it, such as scientific programming, image processing and web development.

Python: All the tools you need, in one place

For data science in particular there is the pandas library which offers powerful abstractions for dealing with timeseries and other types of tabular data. This library and many others (such as seaborn and scikit-learn) give effective, tried, and tested tools to the data scientist so that they can focus on exploring the data.

The fact that Python is multi-purpose means that it is easier to build data science tools that fit well into the larger infrastructure. As an example, when one is proficient in Python and data science, it is easier to make the leap and learn to use a web server library such as Flask and succeed in providing an online version of their analysis tool.

The need for integration of data science with the web has been recognized from early on in the Python community. This demand has resulted in a web version of the already mentioned IPython shell being developed. The Jupyter Notebook, formerly known as the IPython Notebook, allows data scientists to share code, graphs and data tables in a single page, and go back and forth trying out new ideas, changing the code and the resulting graphs. Since late 2011 when it was first released, it has aided data scientists to do exploratory data analysis and share key results internally but also externally. To get a direct idea of how the Jupyter Notebook works you can check this page of great examples.

As mentioned in the introduction, the community of Python is large and beginner-friendly. In practice, that means that it is relatively easy to search or post on stackoverflow or other similar websites and get help. It is also relatively simple for others to look and understand your code, because the community has evolved to create the tools to make it so, and that makes getting help and feedback quite straight forward.

The demand for data scientists that are proficient in Python is also already large and growing. Google is using Python for a variety of tasks and Airbnb, Pinterest and Spotify for their data pipelines. Countless startups (including some in Zürich) are using it for both prototyping and the final product. In one of the startups I was working at, we started from a Jupyter Notebook and ended up with a Python-orchestrated data pipeline for our analysis. And finally, according to the O’Reilly Data Science Survey, using Python as a data scientist is one of the two major boosting factors to one’s salary.

The future of data science and Python is pretty exciting. Data Science is a new field, evolving rapidly. More and more abstractions are being developed to make it easier for the data scientist to focus on what really matters, which is look at data critically to automate away as much of the work is possible.

Providing an easy-to-use and uncluttered interface to the programmer is something that is baked inside Python’s philosophy as a language, so I expect this trend to continue with more great libraries and ways of looking at data and sharing insights being developed. Python is here to stay in the data ecosystem: with the momentum the language has along with the available tools and courses, there’s no better time to start learning it than now.