How not to design a wire protocol

By Eric Raymond

A wire protocol is a way to pass data structures or aggregates over a serial channel between different computing environments. At the very lowest level of networking there are bit-level wire protocols to pass around data structures called “bytes”; further up the stack streams of bytes are used to serialize more complex things, starting with numbers and working up to aggregates more conventionally thought of as data structures. The one thing you generally cannot successfully pass over a wire is a memory address, so no pointers.

Designing wire protocols is, like other kinds of engineering, an art that responds to cost gradients. It’s often gotten badly wrong, partly because of clumsy technique but mostly because people have poor intuitions about those cost gradients and optimize for the wrong things. In this post I’m going to write about those cost gradients and how they push towards different regions of the protocol design space.

My authority for writing about this is that I’ve implemented endpoints for nearly two dozen widely varying wire protocols, and designed at least one wire protocol that has to be considered widely deployed and successful by about anybody’s standards. That is the JSON profile used by many location-aware applications to communicate with GPSD and thus deployed on a dizzying number of smartphones and other embedded devices.

I’m writing about this now because I’m contemplating two wire-protocol redesigns. One is of NTPv4, the packet format used to exchange timestamps among cooperating time-service programs. The other is an unnamed new protocol in IETF draft, deployed in prototype in NTPsec and intended to be used for key exchange among NTP daemons authenticating to each other.

Here’s how not to do it…

NTPv4 is a really classic example of one extreme in wire protocol design. A base NTP packet is 48 bytes of packed binary blob that looks like this:

 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |LI | VN |Mode | Stratum | Poll | Precision | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Root Delay | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Root Dispersion | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reference ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Reference Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Origin Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Receive Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Transmit Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

The numbers are bit widths. If I showed you an actual packet dump it would be a random-looking blob of characters with no significance at the character level; only the bits matter.

It’s not very relevant to this episode what the detailed semantics of those fields are, though you can make some guesses from the names and probably be right; just think of it as a clock sample being passed around. The only two we’re going to care about here is VN, which is a three-bit protocol version field normally set to 0b100 = 4, and mode – three more bits of packet type. Most of the others are interpreted as binary numbers except for “Reference ID”, which is either an IPv4 address or a 4-character string.

Here’s a GPSD report that exemplifies the opposite extreme in wire-protocol design. This is an actual if somewhat old Time-Position-Velocity packet capture from the GPSD documentation:

 {"class":"TPV","time":"2010-04-30T11:48:20.10Z","ept":0.005, "lat":46.498204497,"lon":7.568061439,"alt":1327.689, "epx":15.319,"epy":17.054,"epv":124.484,"track":10.3797, "speed":0.091,"climb":-0.085,"eps":34.11,"mode":3} 

Those of you with a web-services background will recognize this as a JSON profile.

You don’t have to guess what the principal fields in this report mean; they have tags that tell you. I’ll end the suspense by telling you that “track” is a course bearing and the fields beginning with “e” are 95%-confidence error bars for some of the others. But again, the detailed field semantics don’t matter much to this episode; what we want to do here is focus on the properties of the GPSD protocol itself and how they contrast with NTPv4.

The most obvious difference is discoverability. Unless you know you’re looking at an NTP packet in advance, seeing the data gives you no clue what it means. On the other hand, a GPSD packet is full of meaning to the naked eye even if you’ve never seen one before, and the rest is pretty transparent once you know what the field tags mean.

Another big difference is bit density. Every bit in an NTPv4 packet is significant; you’re squeezing the information into as short a packet as is even theoretically possible. The GPSD packet, on the other hand, has syntactic framing and tags that tell you about itself, not its payload.

These two qualities are diametrically opposed. The bits you spend on making a wire protocol discoverable are bits you’re not spending on payload. That both extremes exist in the world is a clue: it means there’s no one right way to do things, and the cost gradients around wire protocols differ wildly in different deployments.

Before I get to a direct examination of those cost gradients I’m going to point out a couple of other contrasting properties. One is that the base NTPv4 packet has a fixed length locked in; it’s 48 bytes, it’s never going to be anything but 48 bytes, and the 32- or 64-bit precision of the numeric fields can never change. The GPSD packet embodies the opposite choice; on the one hand it is variable-length as the number of decimal digits in the data items change, on the other hand it is quite easy to see how to ship more precision in the GPSD packet if and when it’s available.

Hardware independence is another important difference. A decimal digit string is a decimal digit string; there’s no real ambiguity about how to interpret it, certainly not if you’ve ever seen a JSON-based protocol before. The binary words in an NTPv4 packet, on the other hand, may need to be byte-swapped to turn into local machine words, and the packet itself does not imply what the right decoding is. You need to have prior knowledge that they’re big-endian…and getting this kind of detail wrong (byte-swapping when you shouldn’t, or failing to when you should) is a really notorious defect attractor.

More generally, these protocols differ greatly in two related qualities; extensibility is one. The other doesn’t have a term of art; it’s whether data encoded in the protocol can mix gracefully with other payload types traveling on the same wire. I’ll call it “sociability”.

(And why does sociability matter? One reason is because the friction cost of poking new holes for new protocols in network firewalls is considerable; it triggers security concerns. This is why so much stuff is multiplexed on HTTP port 80 these days; it isn’t only for convenience with browsers.)

Adding a new field to a JSON datagram, or more generally any other kind of self-describing protocol), is not difficult. Even if you’ve never seen JSON before, it’s pretty easy to see how a new field named (say) “acceleration” with a numeric value would fit in. Having different kinds of datagrams on the wire is also no problem because there’s a class field. GPSD actually ships several other reports besides TPV over the same service port.

It’s trickier to see how to do the analogous things with an NTPv4 packet. It is possible, and I’m now going to walk you through some fairly painful details not because they’re so important in themselves but because they illustrate some systematic problems with packed binary protocols in general. There will be no quiz afterwards and you can forget them once you’ve absorbed the general implications.

In fact NTPv4 has an extension-field mechanism, but it depends on a quirk of the transmission path: NTPv4 packets are UDP datagrams and arrive with a length. This gives you a dodge; if you see a length longer than 48 bytes, you can assume the rest is a sequence of extension fields. Here’s what those look like:

 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type field | Payload length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Payload (variable) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

Good luck eyeballing that! It’s simple in concept, but it’s more binary blob. As with the base packet, you need a special tool like wireshark and a detailed spec in front of you just to interpret the type fields, let alone whatever wacky private encodings get invented for the payload parts.

Actually, this last section was partly a fib. Detecting NTPv4 extension fields is tricky because it interacts with a different, older extension – an optional cryptographic signature which can itself have two different lengths:

 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Key Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | dgst (128 or 160) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

It is possible to work out whether one or both kinds of extension are present by doing some tricky modular arithmetic, but I’ve tortured you enough without getting into exactly how. The thing to take away is that gymnastics are required compared to what it takes to add extensions to a JSON-based protocol, and this isn’t any accident or evidence that NTPv4 is especially ill-designed. This kind of complexity is generic to packed binary protocols, and that has implications we’ll focus in on when we got to cost gradients.

In fact NTPv4 was not badly designed for its time – the Internet protocol design tradition is pretty healthy. I’ve seen (and been forced by standards to implement) much worse. For please-make-it-stop awfulness not much beats, for example, the binary packet protocol used in Marine AIS (Automatic Identification system). One of its packet types, 22 (Channel Management), even has a critical mode bit controlling the interpretation of an address field located after the address field rather than before. That is wrap-a-tire-iron-around-somebody’s-head stupid; it complicates writing a streaming decoder and is certain to attract bugs. By comparison the NTPv4 design is, with all its quirks, quite graceful.

It is also worth noting that we had a narrow escape here. UDP protocols are now unusual, because they have no retransmission guarantees. Under TCP, you don’t get a whole datagram and a length when you read off the network. A TCP equivalent of the NTPv4 packet protocol would either have been fixed at 48 bits no extension forever or have needed to give you some way to compute the expected packet length from data that’s within a minimum-size distance of the start of packet.

JSON evades this whole set of complications by having an unambiguous end delimiter. In general under TCP your packets need to have either that or an early length field. Computing a length from some constellation of mode bits is also available in principle, but it’s asking for trouble. It is…say it with me now…a defect attractor. In fact it took six years after the NTPv4 RFC to issue a correction that clarified the edge cases in combination of crypto-signature and extensions.

What about sociability? The key to it is those version and mode fields. They’re at fixed locations in the packet’s first 32-bit word. We could use them to dispatch among different ways of interpreting everything part those first 8 bits, allowing the field structure and packet length to vary.

NTPv4 does in fact do this. You might actually see two different kinds of packet structure on the wire. The diagram above shows a mode 2 or 3 packet; there’s a mode 6 packet used for control and monitoring that (leaving out an optional authentication trailer) looks like this instead:

 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |LI | VN | 6 |R|E|M| Opcode | Sequence | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Status | Association ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Offset | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Payload (variable) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 

The count field tells you the length of the variable part. Self-description!

Two packet structures, seven potential mode values. You might be wondering what happened to the other five – and in fact this illustrates one of the problems with the small fixed-length fields in packed-binary formats. Here’s the relevant table from RFC5905:

 +-------+--------------------------+ | Value | Meaning | +-------+--------------------------+ | 0 | reserved | | 1 | symmetric active | | 2 | symmetric passive | | 3 | client | | 4 | server | | 5 | broadcast | | 6 | NTP control message | | 7 | reserved for private use | +-------+--------------------------+ 

You don’t have to know the detailed meanings of all of these to get that the mode field mixes information about the packet structure with other control bits. In fact values 1 through 5 all have the same structure, mode 6 has the one I just diagrammed, and all bets are off with mode 7.

When you’re optimizing for highest bit tendency – which is what they were doing in 1985 when this protocol was originally designed – the temptation to do this sort of thing is pretty near irresistible. The result, 34 years later, is that all the bits are taken! We can’t hardly get any more multivalent with this field without committing a backward incompatibility – not a really safe thing to do when there are lots of big-iron legacy implementations still out there, pinned in place by certification requirements and sheer bureaucratic inertia.

OK, in theory we could claim mode 0. But I’ve seen several of the decoders out there and I would not warrant that a mode 0 packet won’t ever slip past anyone’s sanity checks to be misinterpreted as something else. On the other hand, decoders do check the version field; they have to, because versions 0 to 3 have existed and there could be ancient time servers out there still speaking them. So the version field gives us a way out; as long as the version field reads 5, 6, or 7, the rest of the packet part that first byte could look like anything we like and can write an RFC for.

I’ve walked you through this maze to illustrate an important point: packed binary formats are extremely brittle under the pressure of changing requirements. They’re unsociable, difficult to extend without defect-attracting trickery, and eventually you run into hard limits due to the fixed field sizes.

NTP has a serious functional problem that stems directly from this. Its timestamps are 64-bit but only half of that is whole seconds; those counters are going to wrap around in 2036, a couple of years before the more widely anticipated Unix-timestamp turnover in 2038. In theory the existing implementations will cope with this smoothly using more clever modular arithmetic. In practice, anybody who knows enough to have gamed out the possible failure scenarios is nervous…and the more we know the more nervous-making it gets.

This is why I’m thinking about NTPv5 now. 2019 is not too soon. Closing the circle, all this would have been avoided if NTP timestamps had looked like “2010-04-30T11:48:20.10Z”, with variable-length integer and decimal parts, from the beginning. So why wasn’t it done that way?

To address that question let’s start by looking at where the advantages in self-describing textual formats vs. packed binary ones stack up. For self-describing: auditability, hardware independence, extensibility, and sociability. For packed binary: highest possible bit density.

A lot of people would adder “faster, simpler decoding” to the list of advantages for packed binary. But this (at least in the “simpler” part) is exactly where peoples’ design intuitions often start to go wrong, and the history of NTPv4 demonstrates why. Packed protocols start out with “simpler”, but they don’t stay that way. In the general case you always end up doing things like tricky modular arithmetic to get around those fixed limits. You always exhaust your mode-bit space eventually. The “faster” advantage does tend to be stable over time, but the “simpler” does not.

(And oh, boy, will you ever have this lesson pounded home to you if, as I did, you spend a decade on GPSD implementing decoders for at least nineteen different GPS wire protocols.)

If you are a typically arrogant software engineer or EE, you may be thinking at this point “But I’m smarter! And I’ve learned from the past! I can design an optimum-bit-density wire-format that avoids these problems!”

And this is my forty years of field experience, with specific and proven domain expertise in wire-protocol design, telling you: This. Never. Happens. The limitations of that style of protocol are inherent, and they are more binding than you understand. You aren’t smart enough to evade them, I’m not, and nobody else is either.

Which brings us back to the question of why NTPv4 was designed the way it was. And when it is still appropriate to design wire protocols in that packed-binary style. Which means that now it’s time to look at cost gradients and deployment environments.

One clue is that the NTP wire protocol was designed decades ago when computing cycles and bits-per-second on the wire were vastly more expensive than they are now. We can put numbers on that. NTP was designed under the cost profile of early ARPANET to operate well with connection speeds not much higher than 50Kbps. Today (2019) average U.S broadband speeds are 64Mps. That’s a factor of 10^3 difference. Over the same period processor speeds have gone up by about 10^3-10^4. There’s room for argument there based on different performance measures, but assuming the low end of that range we’re still looking at about the same cost change as bits on the wire.

Now let me throw an interesting number at you that I hope brings home the implications of that change. A few weeks ago we at NTPsec had an email conversation with a guy who is running time service out of the National Metrology Institute in Germany. This is undoubtedly Deutschland’s most heavily loaded Stratum 1 NTP provider.

We were able to get his requests per-second figure, do a bit of back-of-the-envelope calculation, and work out that the production NTP load on a a national time authority for the most populous nation in Europe (excluding transcontinental Russia) wouldn’t come even close to maxing out my home broadband or even just one of the Raspberry Pi 3s on the windowsill above my desk. With all six of them and only a modest bandwidth increase I could probably come pretty close to servicing the Stratum 2 sites of the entire planet in a pinch, if only because time service demand per head is so much lower outside North America/Europe/Japan.

Now that I have your attention, let’s look at the fundamentals behind this. That 10^3 drop tracks the change in one kind of protocol cost that is basically thermodynamic. How much power do you have to use, what kind of waste heat do you generate, if you throw enough hardware at your application to handle its expected transaction load? Most of what are normally thought of as infrastructure costs (equipping your data center, etc.) are derivative of that thermodynamic cost. And that is the cost you are minimizing with a packed binary format.

In the case of NTP, we’ve just seen that cost is trivial. The reason for this is instructive and not difficult to work out. It’s because NTP transaction loads per user are exceptionally low. This ain’t streaming video, folks – what it takes to keep two machines synchronized is a 48-byte call and a 48-byte response at intervals which (as I look at a live peers display just now) average about 38 minutes.

There’s just a whisper of a nuance of a hint there that just mmmaybe, three decades after it was first deployed, optimizing NTP for bit density on the wire might not be the most productive use of our effort!

Maybe, in another application with 10^3 more transaction volume per user, or with a 10^3 increase in userbase numbers, we’d incur as much thermodynamic cost as landed on a typical NTP server in 1981, and a packed binary format would make the kind of optimization sense it did then. But that was then, this is now, and peoples’ intuitions about this tend to be grossly out of whack. It’s almost as though a lot of software engineers and EEs who really ought to know better are still living in 1981 even when they weren’t born yet.

OK, so what should we be optimizing NTP for instead? Take a moment to think about this before you keep reading, because the answer is really stupid obvious.

We should be designing to minimize the cost of human attention. Like thermodynamic cost, attention cost unifies a lot of things we normally think of as separate line items. Initial development. Test. Debugging. Coping with the long-term downstream defect rate. Configuring working setups. Troubleshooting not-quite-working setups. And – this is where you should start to hear triumphant music – dealing gracefully with changes in requirements.

It is also a relevant fact that the cost of human attention has not dropped by 10^3 along with thermodynamic cost per unit of information output since 1981. To a good first approximation it has held constant. Labor-hours are labor-hours are labor-hours.

Now let’s review where the advantages of discoverable/textual formats are. Auditability. Hardware independence. Sociability. Extensibility. These are all attention-cost minimizers. They’re, very specifically, enablers of forward design. In the future you’ll always need what a Go player calls “aji” (potential to extend and maneuver). Discoverable textual wire protocols are good at that; packed binary protocols are bad at it.

But I’m absolutely not here to propose a cost model under which discoverability is in a simple linear struggle with bit-density that discoverability always wins in the end. That’s what you might think if you notice that the ratio between attention cost and thermodynamic cost keeps shifting to favor discoverability as thermodynamic cost falls while attention cost stays near constant. But there’s a third factor that our NTP estimation has already called out.

That factor is transaction volume. If you pull that low enough, your thermodynamic costs nearly vanish and packed binary formats look obviously idiotic. That’s where we are with NTP service today. Consequently, my design sketch for NTPv5 is a JSON profile.

On the other hand, suppose you’re running a Google-sized data center, the kind that’s so big you need to site it very near cheap power as though it were an aluminum smelter. Power and heat dissipation are your major running costs; it’s all about the thermodynamics, baby.

Even that kind of deployment, NTP service will still be thermodynamically cheap. But there will be lots of other wire protocols in play that have transaction volumes many orders of magnitude higher…and now you know why protocol buffers, which are sure enough packed binary, are a good idea.

The thought I want to leave you all with is this: to design wire protocols well, you need to know what your cost drivers really are, how their relative magnitudes stack up. And – I’m sorry, but this needs emphasizing – I constantly run into engineers (even very bright and capable ones) whose intuitions about this are spectacularly, ludicrously wrong.

You, dear reader, might be one of them. If it surprised you that a credit-card-sized hobby computer could supply Stratum 1 service for a major First-World country, you are one of them. Time to check your assumptions.

I think I know why people get stuck this way. It’s what Frederic Bastiat called a “things seen versus things not seen” problem in economic estimation. We over-focus on metrics we can visualize, measure, and estimate crisply; thermodynamic costs and the benefits of perfect bit density tends to be like that. Attention costs are squishier and more contingent, it’s more difficult to value options, and it’s especially easy to underestimate the attention cost of having to do a major re-engineering job in the future because the design you’re sketching today is too rigid and didn’t leave you the option to avoid a disruption.

One of the deeper meanings of the quote “Premature optimization is the root of all evil” (often misattributed to Donald Knuth but actually by Tony Hoare), is that you should constantly beware of doing that. Nassem Taleb, the “Black Swan” guy, would rightly call it fragilista behavior, over-planner’s arrogance. In the real world, aji usually beats arrogance – not every time, but that’s the way to bet.