Why Everyone Should #DeletePython


Disclaimer: this article reflects my personal opinion only. Reader discretion is advised.

Figure: How Beemo changes batteries (Adventure Time). Cute but dangerous, just like working with Python 2 and Python 3 simultaneously.

OK. There I said it. Everybody should consider #DeletePython.

Recently, people have raised significant concerns regarding Facebook’s privacy protection practice on their user’s data, with the discussions on Twitter: #DeleteFacebook. This inspired me to say something about another spicy topic: #DeletePython.

Well, I have to admit that “everyone” might be a bit exaggerating. The accurate central message would be, everyone who wants to do some decent, sustainable, modern statistics, should consider R, but not Python, as the primary language.

I know this may offend some Python lovers, but I do have my reasons, explained below.

For almost ten years, I have been doing almost everything that involves programming without major bottlenecks in R. What I have done includes frequentist/Bayesian statistics, (statistical) machine learning, recommender systems, bioinformatics, cheminformatics, data visualization, creating RESTful API, and building web applications. That gives you an idea that there are few limits on what you can achieve with R.

As the challenger, I believe Python is a solid general purpose programming language that could also help people do these things well. Notably, for “data science” related tasks, many people from computer science backgrounds probably see Python as a nice, free replacement for the good old MATLAB (may it rest in peace) for doing machine learning. People came from pure software engineering backgrounds might also want to try some statistical modeling in such a language that they are already familiar.

However, all things being equal, I have to point out that Python also has seemingly trivial but obvious issues that indeed blocks me to make myself a contributor to its ecosystem, or even consider it as a serious language, at least for the “data science” purposes.

Every time I think about using specific code written in Python, instead of performance considerations, the first question popped into my head would be: which Python version should I use, 2 or 3? As is known to most people, Python has this ridiculous issue of two parallel versions/ecosystems with significant incompatibility for a relatively long time.

In some Linux distributions, python means Python 3, python2 means Python 2; while in other distributions, python means Python 2, you need python3 to call Python 3. Sometimes, it is even more complex, for example, if you use Homebrew Python, you will be essentially dealing with python, python2, and python3, along with pip, pip2, pip3.

Recently, there have been genuine improvements following the progress made by the better adoption of Python 3, while this could still be a pain point for using the legacy code written in 2 together with the more recent modules written in 3. While in R, there is one and only one latest stable version, and we would always recommend installing this version.

One side effect of the two interpreter versions is the chaotic version/package/environment management. Tools like pyenv, virtualenv, or conda were created to alleviate such issues. However, I think they made things worse for me eventually. Most of the time we only need a latest, working interpreter with the latest packages to run the code, it is that simple — we do not want to learn someone else’s configurations or fire up some completely different package manager just for reusing other people’s code.

The version incompatibility and package management issues would almost surely create technical, even political problems within large organizations. As many statistical procedures were designed to minimize unnecessary communications and reduce the error-prone human factors, such issues should be considered harmful for statistical practice.

One of the most common operations when you code with data, is copying one object to a new object, then modify the new object. To do this in R, we merely need to do b = a. When we modify b, a will not be affected. Life is beautiful.

In contrast, Python decided to perform shallow copy for its mutable objects. Every time, you will need to remember to use special ways to achieve such a copy operation, for example, b = a[:] or b = copy.deepcopy(a). If not done correctly, you could end up with some unexpected bugs in your code. Believe me, when we say copy something, we mean to copy it, and surely do not want the modifications for the new object to affect the original one, especially if you do this every day.

On a relevant note, as a functional programming language, R’s built-in copy-on-modify semantics made it much, much safer to modify values, and improves the maintainability of the code. As long as you are not too careless or not using vectorized code at all, there will not be significant performance issues. In case you want to optimize, you will be rewarded for delicate mathematical and vectorized thinking, which is pretty cool. Eventually, such functional designs save human time — the more significant bottleneck in the long run.

Python does not have built-in data structures suitable for doing statistics. You will almost surely need to use third-party data structures such as NumPy array or pandas “DataFrame”. Despite their interoperability and performance issues, if you used them, you would need the other people to use the same libraries (sometimes the same versions…) to reuse and understand your code. It is not something particularly convenient.

In contrast, I have trusted the vanilla data frame in R and all the robust base functions for a long time, not to mention the powerful tidyverse extensions (dplyr, pipes). In fact, the abstraction of vector, matrix, data frame, and list is brilliant. These data structures are designed and forged in a way that they eventually became a natural part of the solution when people think and solve statistical problems, without thinking much about how exactly they need to store the data or compute on the data.

Beyond that, I also love the vector-oriented design and thinking in R. Everything is a vector: factors are special vectors; matrices and arrays are vectors with dimension attributes; lists are recursive vectors, data frames are special lists, thus special vectors. After a while, you will realize that treating vector as the atomic data structure and building everything on it, fits the paradigm of statistical thinking perfectly.

I have personally tried almost every mainstream Python IDEs, or editors with extensions. The conclusion is sad: none of them are good enough. Maybe my bar is too high, but they all seem to have some common issues, such as lacking essential features for working with data (think an object inspector). The most critical problem is, most of them simply have bad tastes and need better aesthetics in design.

For R, I do not think there is much need to explain: RStudio is perfect in almost all aspects. It is good even if you compare it with the many IDEs for many other languages. Even before RStudio, we had RKward, a full-featured beautiful R IDE from the KDE project.