Reposurgeon’s Excellent Journey and the Waning of Python

By Eric Raymond

Time to make it public and official. The entire reposurgeon suite (not just repocutter and repomapper, which have already been ported) is changing implementation languages from Python to Go. Reposurgeon itself is about 37% translated, with pretty good unit-test coverage. Three of my collaborators on the project (Daniel Brooks, Eric Sunshine, and Edward Cree) have stepped up to help with code and reviews.

I’m posting about this because the pressures driving this move are by no means unique to the reposurgeon suite. Python, my favorite working language for twenty years, can no longer cut it at the scale I now need to operate – it can’t handle large enough working sets, and it’s crippled in a world of multi-CPU computers. I’m certain I’m not alone in seeing these problems; if I were, Google, which used to invest heavily in Python (they had Guido on staff there for a while) wouldn’t have funded Go.

Some of Python’s issues can be fixed. Some may be unfixable. I love Guido and the gang and I am vastly grateful for all the use and pleasure I have gotten out of Python, but, guys, this is a wake-up call. I don’t think you have a lot of time to get it together before Python gets left behind.

I’ll first describe the specific context of this port, then I’ll delve into the larger issues about Python, how it seems to be falling behind, and what can be done to remedy the situation.

The proximate cause of the move is that reposurgeon hit a performance wall on the GCC Subversion repository. 259K commits, bigger than anything else reposurgeon has seen by almost an order of magnitude; Emacs, the runner-up, was somewhere a bit north of 33K commits when I converted it.

The sheer size of the GCC repository brings the Python reposurgeon implementation to its knees. Test conversions take more than nine hours each, which is insupportable when you’re trying to troubleshoot possible bugs in what reposurgeon is doing with the metadata. I say “possible” because we’re in a zone where defining correct behavior is rather murky; it can be difficult to distinguish the effects of defects in reposurgeon from those of malformations in the metadata, especially around the scar tissue from CVS-to-SVN conversion and near particularly perverse sequences of branch copy operations.

I was seeing OOM crashes, too – on a machine with 64GB of RAM. Alex, I’ll take “How do you know you have a serious memory-pressure problem?” for $400, please. I was able to head these off by not running a browser during my tests, but that still told me the working set is so large that cache misses are a serious performance problem even on a PC design specifically optimized for low memory-access latency.

I had tried everything else. The semi-custom architecture of the Great Beast, designed for this job load, wasn’t enough. Nor were accelerated Python implementations like cython (passable) or pypy (pretty good). Julien Rivaud and I did a rather thorough job, back around 2013, of hunting down and squashing O(n^^2) operations; that wasn’t good enough either. Evidence was mounting that Python is just too slow and fat for work on really large datasets made of actual objects.

That “actual objects” qualifier is important because there’s a substantial scientific-Python community working with very large numeric data sets. They can do this because their Python code is mostly a soft layer over C extensions that crunch streams of numbers at machine speed. When, on the other hand, you do reposurgeon-like things (lots of graph theory and text-bashing) you eventually come nose to nose with the fact that every object in Python has a pretty high fixed minimum overhead.

Try running this program:

 from __future__ import print_function import sys print(sys.version) d = { "int": 0, "float": 0.0, "dict": dict(), "set": set(), "tuple": tuple(), "list": list(), "str": "", "unicode": u"", "object": object(), } for k, v in sorted(d.items()): print(k, sys.getsizeof(v)) 

Here’s what I get when I run it under the latest greatest Python 3 on my system:

 3.6.6 (default, Sep 12 2018, 18:26:19) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] dict 240 float 24 int 24 list 64 object 16 set 224 str 49 tuple 48 unicode 49 

There’s a price to be paid for all that dynamicity and duck-typing that the scientific-Python people have evaded by burying their hot loops in C extensions, and the 49-byte per-string overhead is just the beginning of it. The object() size in that table is actually misleadingly low; object instance is a dictionary with its own hash table, not a nice tight C-like struct with fields at fixed offsets. Field lookup costs some serious time.

Those sizes may not look like a big deal, and they aren’t – not in glue scripts. But if you’re instantiating 359K objects containing actual data the overhead starts to pile up fast.

Alas, I can’t emulate the scientific-Python strategy. If you try to push complex graph-theory computations into C your life will become a defect-riddled hell, for reason’s I’ve previously described as greenspunity. This is not something you want to do, ever, in a language without automatic memory management.

Trying to break the GCC conversion problem into manageable smaller pieces won’t work either. This is a suggestion I’m used to hearing from smart people when I explain the problem. To understand why this won’t work, think of a Subversion repository as an annotated graph in which the nodes are (mainly) things like commit representations and the main link type is “is a parent of”. A git repository is a graph like that too, but with different annotations tied to a different model of revisioning.

The job of reposurgeon is to mutate a Subversion-style graph into a git-style graph in a way that preserves parent relationships, node metadata, and some other relations I won’t go into just now. The reason you can’t partition the problem is that the ancestor relationships in these graphs have terrible locality. Revisions can have parents arbitrarily far back in the history, arbitrarily close to the zero point. There aren’t any natural cut points where you can partition the problem. This is why the Great Beast has to deal with huge datasets in memory all at once.

My problem points at a larger Python issue: while there probably isn’t much work on large datasets using data structures quite as complex and poorly localized as reposurgeon’s, it’s probably less of an outlier in the direction of high overhead than scientific computation is in the direction of low. Or, to put it in a time-focused way, as data volumes scale up the kinds of headaches we’ll have will probably look more like reposurgeon’s than like a huge matrix-inversion or simulated-annealing problem. Python is poorly equipped to compete at this scale.

That’s a general problem in Python’s future. There are others, which I’ll get to. Before that, I want to note that settling on a new implementation language was not a quick or easy process. After the last siege of serious algorithmic tuning in 2013 I experimented with Common LISP, but that effort ran aground because it was missing enough crucial features to make the gap from Python look impractical to bridge. A few years later I looked even more briefly at Ocaml; same problem. actually even worse.

I didn’t make a really serious effort to move sooner than 2018 because, until the GCC repository, I was always able to come up with some new tweak of reposurgeon or the toolchain underneath it that would make it just fast enough to cope with the current problem. But the problems kept getting larger and nastier (I’ve noted the adverse selection problem here). The GCC repo was the breaking point.

While this was going on, pre-GCC, I was also growing somewhat discontented with Python for other reasons. The most notable one at the time was the Python team’s failure to solve the notorious GIL (Global Interpreter Lock) problem. The GIL problem effectively blocks any use of concurrency on programs that aren’t interrupted by I/O waits. What it meant, functionally, was that I couldn’t use multithreading in Python to speed up operations like comment-text searches; those never hit the disk or network. Annoying…here I am with a 16-core hot-rod and reposurgeon can only use one (1) of those processors.

It turns out the GIL problem isn’t limited to non-I/O-bound workloads like mine, either, and it’s worse than most Python developers know. There’s a rather terrifying talk by David Beazley showing that the GIL introduces a huge amount of contention overhead when you try to thread across multiple processors – so much so that you can actually speed up your multi-threaded programs by disabling all but one of your processors!

This of course isn’t just a reposurgeon problem. Who’s going to deploy Python for anything serious if it means that 15/16ths of your machine becomes nothing more than a space heater? And yet the Python devs have shown no sign of making a commitment to fix this. They seem to put a higher priority on not breaking their C extension API. This…is not a forward-looking choice.

Another issue is the Python 2 to 3 transition. Having done my bit to make it as smooth as possible by co-authoring Practical Python porting for systems programmers with reposurgeon collaborator Peter Donis, I think I have the standing to say that the language transition was fairly badly botched. A major symptom of the botchery is that the Python devs unnecessarily broke syntactic compatibility with 2.x in 3.0 and didn’t restore it until 3.2. That gap should never have opened at all, and the elaborateness of the kluges Peter and I had to develop to write polyglot Python even after 3.2 are an indictment as well.

It is even open to question whether Python 3 is a better language than Python 2. I could certainly point out a significant number of functional improvements, but they are all overshadowed by the – in my opinion – extremely ill-advised decision to turn strings into Unicode code-point sequences rather than byte sequences.

I felt like this was a bad idea when 3.0 shipped; my spider-sense said “wrong, wrong, wrong” at the time. It then caused no end of complications and backward-incompatibilities which Peter Donis and I later had to paper over. But lacking any demonstration of how to do better I didn’t criticize in public.

Now I know what “Do better” looks like. Strings are still bytes. A few well-defined parts of your toolchain construe them as UTF-8 – notably, the compiler and your local equivalent of printf(3). In your programs, you choose whether you want to treat string payloads as uninterpreted bytes (implicitly ASCII in the low half) or as Unicode code points encoded in UTF-8 by using either the “strings” or “unicode” libraries. If you want any other character encoding, you use codecs that run to and from UTF-8.

This is how Go does it. It works, it’s dead simple, it confines encoding dependencies to the narrowest possible bounds – and by doing so it demonstrates that Python 3 code-point sequences were a really, really bad idea.

The final entry in our trio of tribulations is the dumpster fire that is Python library paths. This has actually been a continuing problem since GPSD and has bitten NTPSec pretty hard – it’s a running sore on our issue tracker, so bad that were’re seriously considering moving our entire suite of Python client tools to Go just to get shut of it.

The problem is that where on your system you need to put a Python library module in order so that a Python main program (or other library) can see it and load it varies in only semi-predictable ways. By version, yes, but there’s also an obscure distinction between site-packages, dist-packages, and what for want of any better term I’ll call root-level modules (no subdirectory under the version directory) that different distributions and even different application packages seem to interpret in different and incompatible ways. The root of the problem seems to be that good practice is under-specified by the Python dev team.

This is particular hell on project packagers. You don’t know what version of Python your users will be running, and you don’t know what the contents of their sys.path (library load path variable). You can’t know where your install production should put things so the Python pieces of your code will be able to see each other. About all you can do is shotgun multiple copies of your library to different plausible locations and hope one of them intersects with your user’s load path. And I shall draw a kindly veil over the even greater complications if you’re shipping C extension modules…

Paralysis around the GIL, the Python 3 strings botch, the library-path dumpster fire – these are signs of a language that is aging, grubby, and overgrown. It pains me to say this, because I was a happy Python fan and advocate for a long time. But the process of learning Go has shed a harsh light on these deficiencies.

I’ve already noted that Go’s Unicode handling implicitly throws a lot of shade. So does its brute-force practice of building a single self-contained binary from source every time. Library paths? What are those?

But the real reason that reposurgeon is moving to Go – rather than some other language I might reasonably think I could extract high performance from – is not either of these demonstrations. Go did not get this design win by being right about Unicode or build protocols.

Go got this win because (a) comparative benchmarks on non-I/O-limited code predict a speedup of around 40x, which is good enough and competitive with Rust or C++, and (b) the semantic gap between Python and Go seemed surprisingly narrow, reducing the expected translation time lower than I could reasonably expect from any other language on my radar.

Yes, static typing vs. Python’s dynamic seems like it ought to be a big deal. But there are several features that converge these languages enough to almost swamp that difference. One is garbage collection; the second is the presences of maps/dictionaries; and the third is strong similarities in low-level syntax.

In fact, the similarities are so strong that I was able to write a mechanical Python-to-Go translator’s assistant – pytogo – that produces what its second user described as a “a good first draft” of a Go translation. I described this work in more detail in Rule-swarm attacks can outdo deep reasoning.

I wrote pytogo around roughly the 22% mark (just short of 4800) lines out of 14000 in the translation and am now up to 37% out of 16000. The length of the Go plus commented-out untranslated Python has been creeping up because Go is less dense – all those explicit close brackets add up. I am now reasonably confident of success, though there is lots of translation left to do and one remaining serious technical challenge that I may discuss in a future post.

For now, though, I want to return to the question of what Python can do to right its ship. For this project the Python devs have certainly lost me; I can’t afford to wait on them getting their act together before finishing the GCC conversion. The question is what they can do to stanch more defections to Go, a particular threat because the translation gap is so narrow.

Python is never going to beat Go on performance. The fumbling of the 2/3 transition is water under the dam at this point, and I don’t think it’s realistically possible to reverse the Python 3 strings mistake.

But that GIL problem? That’s got to get solved. Soon. In a world where a single-core machine is a vanishing oddity outside of low-power firmware deployments, the GIL is a millstone around Python’s neck. Otherwise I fear the Python language will slide into shabby-genteel retirement the way Perl has, largely relegated to its original role of writing smallish glue scripts.

Smothering that dumpster fire would be a good thing, too. A tighter, more normative specification about library paths and which things go where might do a lot.

Of course there’s also a positioning issue. Having lost the performance-chasers to Go, Python needs to decide what constituency it wants to serve and can hold onto. That problem I can’t solve, just point out what technical problems are both seriously embarrassing and fixable. That’s what I’ve tried to do.

As I said at the beginning of this rant, I don’t think there’s a big window of time in which to act, either. I judge the Python devs do not have a year left to do something convincing about the GIL before Go completely eats their lunch, and I’m not sure they have even six months. They’d best get cracking.