A little over three years ago, I picked up a boxed set of the first three volumes of Donald Knuth's "The Art of Computer Programming" (TAOCP). I knew them by reputation before I ever even considered reading them — I knew, for example, that they had a reputation for being packed with dense mathematicalese along with a lot of brain-bendingly difficult problems. For whatever reason, that's the sort of odd challenge that appeals to me, so I went into the series with the intent to work every single problem. That turned out to be unrealistically ambitious - for one thing, there are a few problems that are marked as "research problems" that Knuth himself admits should take a motivated graduate student a semester to complete satisfactorily. There are even a few problems that he lists as "unsolved research problems" that may not even have a solution. I ended up settling for making at least a realistic attempt at every problem; even so, I was able to work far fewer than I expected I should have been able to. This is some seriously advanced material. I finally finished volume 3 last month, after three years of working as many problems as I was smart enough to work (which turned out to be fewer than I'd care to admit - but don't judge me until you try it yourself!)

I know that a lot of programmer types talk about these books — more than once I've seen them referred to as the books that everybody recommends but nobody actually reads. A lot of people are put off by the academic packaging or the cost; I can't disagree with either of those concerns. The books are heavily academic — to the point that they can and have been used as college textbooks. They're also expensive but, I think, worth the cost.

Another reason people sometimes suggest staying away from TAOCP is that the material is irrelevant in the 21st century. This is sort of a more difficult concern to counter, honestly, especially now that I've finished all three books. On the one hand, I know that the Sun developers who put together the Java programming language leaned heavily on Knuth's discussion of random number generation and large-number mathematical operations in volume 2. On the other hand, though, most of us won't be undertaking any projects like the Java programming language in our careers. It's neat to read and entertaining to know, but you can probably get along your whole life just using the PRNG and mathematical libraries built into your programming environment without ever really understanding what makes them tick.

Volume 3 actually contains both the most relevant along with the least relevant material. The most
relevant includes Knuth's discussion of sorting and searching routines. Of course, if you've studied
computer science at any level, you already have a decent handle on sorting algorithms; you probably
know that Quicksort is O(log n) and Bubblesort is O(n^{2}) and even if you can't quite remember
what that terminology means, you know that Quicksort is better than bubble sort. I remember studying
sorting algorithms as an undergraduate from Sedgewick; Sedgewick presented the worst-case behavior of
each of the sorting algorithms and sort of punted on the "average" case by presenting them but
explaining that their derivation was "beyond the scope of the book". As far as I know, besides papers published in inaccessible academic journals, the derivation of the average case of most
sorting routines is beyond the scope of *every* book except for TAOCP where he patiently builds
the concepts up from scratch. He does the same with many standard CS concepts like AVL trees and
hash codes as well.

However, the least relevant material to a modern programmer appears in the middle of volume 3, spanning about 100 pages: a detailed description on how to most efficiently use external tape drives to sort data that's too large to fit into main memory. As with most things Knuth, the presentation is fascinating and the theory is amazing. In short, the idea is to sort as much data as you can onto as many tapes as you have, and then merge the data from the tapes onto a final sorted tape. He considers the time needed to rewind the tapes as well as taking advantage of tape drives that can read and write both forward and backward, even accounting for things like rotational speed and the time it takes a tape to get rotating at full speed. As amazing as it is that anybody was ever able to work this out to such efficiency, I can't imagine how any of these algorithms could be applied outside of their original use case and if you really actually need to sort more data than you can fit into RAM, there are almost definitely better solutions than a bank of external tape drives. As carefully as I read all the other sections, I did find myself skimming through the more difficult exercises in this section.

Still, even setting aside the irrelevance of the middle section, it's as questionable how useful the rest of the content is to a modern software developer. It's almost as unlikely that any of us will be implementing our own sorting or searching routines as it is that we'll be developing the sorts of random number generators or unlimited precision math libraries that were the focus of volume 2. As interesting as the material is, these books are a huge time commitment, and if you don't have a lot of time to spare, you can probably spend it more "efficiently" on the study of machine learning algorithms or something else "enterprisey". If you do have the time, though, whether you actually apply the lessons from these books or not, it's fascinating to work through.

The last concern I hear (or read, anyway) from people is that the books not only use assembler for code examples, but use an imaginary assembler language that Knuth made up just for these books. Well, if you're on the fence about taking the plunge and diving into TAOCP and you're concerned about the cost, content, or relevance of these books, I can't say that I can reasonably set your mind at ease on any of these counts. However, if you're put off by the use of assembler, I can assure you that it's actually a relatively small part of the books. Further, all of the algorithms are presented first in what Knuth calls "pseudocode" but is nearly a compilable language. In fact, it's close enough to Python that translating it almost verbatim is practically a mechanical process.

Consider listing 1, below. This is Knuth's description of one of the more complex algorithms from volume 3, the AVL tree algorithm which keeps a binary tree balanced dynamically as new nodes are added to it. As you can see, Knuth has variables, structures, and looping constructs (although he doesn't seem to have had a chance to confer with Djikstra with regards to the use of GOTOs).

```
```The nodes of the tree are assumed to contain KEY, LLINK, and RLINK fields as in
Algorithm 6.2.2T. We also have a new field B(P) = balance factor of NODE(P) the height of the right subtree minus the height of the left subtree; this field
always contains either +1, 0, or -1. A special header node also appears at the top
of the tree, in location HEAD; the value of RLINK(HEAD) is a pointer to the root
of the tree, and LLINK(HEAD) is used to keep track of the overall height of the tree.
(Knowledge of the height is not really necessary for this algorithm, but it is useful
in the concatenation procedure discussed below.) We assume that the tree is nonempty,
namely that RLINK(HEAD) != ^. For convenience in description, the algorithm uses the notation LINK(a,P) as a synonym
for LLINK(P) if a = -1, and for RLINK(P) if a = +1. A1. [Initialize.] Set T <-> KEY(P), go to A4; and if K = KEY(P), the
search terminates successfully. A3. [Move left.] Set Q <-> KEY(P), set B(P) <->

Converting this description into working Python code doesn't even require a deep understanding of
the underlying algorithm. Knuth uses ^ to represent "NIL" (Python's `NONE`

), <->B(P) to represent the `B`

field of `P`

which most
languages indicate with `P.B`

. I've removed some of the exposition but kept the
essentials of the description in comments in the Python translation in listing 2, below. All of the
comments are Knuth's words, not mine.

```
```# The nodes of the tree are assumed to contain KEY, LLINK, and RLINK fields.
# We also have a new field
#
# B(P) = balance factor of NODE(P)
#
# the height of the right subtree minus the height of the left subtree; this field
# always contains either +1, 0, or -1. A special header node also appears at the top
# of the tree, in location HEAD; the value of RLINK(HEAD) is a pointer to the root
# of the tree, and LLINK(HEAD) is used to keep track of the overall height of the tree.
# We assume that the tree is nonempty, namely that RLINK(HEAD) != ^. class node: def __init__(self): self.KEY = None self.LLINK = None self.RLINK = None self.B = 0 HEAD = node()
HEAD.RLINK = node()
HEAD.RLINK.KEY = 'A' # For convenience in description, the algorithm uses the notation LINK(a,P) as a synonym
# for LLINK(P) if a = -1, and for RLINK(P) if a = +1. def LINK(a, P): if a == -1: return P.LLINK else: return P.RLINK def setLINK(a, P, val): if a == -1: P.LLINK = val else: P.RLINK = val def avl(K):
# A1. [Initialize.] Set T <-> KEY(P), go to A4; and if K = KEY(P), the
# search terminates successfully. while True: if K == P.KEY: return P elif K < P.KEY:
# A3. [Move left.] Set Q <-> P.KEY: Q = P.RLINK if Q is None: Q = node() P.RLINK = Q break else: if Q.B != 0: T = P S = Q P = Q
# A5. [Insert.] Set KEY(Q) <-> KEY(P), set B(P) <-> P.KEY: P.B = 1 P = P.RLINK # A7. [Balancing act.] Several cases now arise:
#
# i) If B(S) = 0, set B(S) <->

As you can see, I changed very little; I kept the same variable names (even with their sort of dated single-capital letter characteristics). The only really tricky part was uncovering the implicit loops defined by his GOTO statements since Python thankfully doesn't actually support the GOTO statement, but even that was relatively straightforward (and remember that this is one of the most complex algorithms described in the series). Otherwise, a handful of modifications:

- I replaced
`Q <=> in steps A3 & A4 with an actual dynamic allocation.`

- Since the implied "convenience function"
`LINK(a, P)`

appears on both the right and left side of assignments, I just went ahead and implemented it with two separate functions; once to function as a getter and another to function as a setter. - Knuth sidestepped the creation of the first node in an empty tree, so I did too, although this would have been trivial to address.
- In step A7, I skipped the step that kept track of the height of the tree, since it wouldn't work in Python, and isn't necessary to the algorithm; it would again be trivial to keep track of this somewhere else.

So there you have it - if you're still skeptical about tackling TAOCP and the only thing keeping you away is the use of Knuth's MIX Assembler, rest assured that you can skip all of the MIX and still understand everything that's presented. Still, you might start to change your mind about investing the time to learn MIX once you get started; it's actually a lot of fun to work through.