Thread by @colmmacc: "O.k. time for a mini-tweet thread which is all about network and TCP optimization! How can we make connections fast and reliable? and what's […]"


O.k. time for a mini-tweet thread which is all about network and TCP optimization! How can we make connections fast and reliable? and what's really going on anyway. It's all way too confusing so let's demystify it a bit.

So I want to start at a really high level and build a useful mental model. So you've got some data that needs to get from A to B. Let's go with something pretty common, like you're trying to download a movie. The movie is out there on some server, and you want it on your FireTV.

Because *obviously* you use FireTV, because it's awesome and better than all of the other options and I'm totally unbiased about that.

Anyway, to get that movie from Amazon Video to you, we have to break it into pieces ... that we call packets. Now, this is where we start to need a mental model of how networks really work.

A common mental model is that networks are like a series of pipes (or tubes, as a Senator famously called it) connected like streams and tributaries ... and information "flows" in these pipes and it's all sort of like a liquid.

this model is kinda half right, but it's wrong in a crucial way. It suggests that if you have a 1gigabit/sec pipe, that you can have ten 100megabit/sec flows all flowing "simultaneously", like a kind of stacking or mixing.

That's not what's really going on. The internet and modern network pipes are actually queues, and that difference is important. Imagine instead that each pipe is really like a long train, going in a loop, and so long and full of railcars that the front and end meet.

🚂🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃 🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚂

Maybe another metaphor that might be useful is each link or pipe is like it's own sushi conveyer belt. 🍣🍣🍣🍣🍣

Now, let's think about the link with this model. We can data from one end to the other by putting it on one end and letting the belt/train get it there. We just have to wait for a free "slot".

How long it takes to get there depends on two things: how soon a free slot comes up, and how far the distance is. But there's one more wrinkle: our packet might not make it all. It might fall off the train along the way! or it might have to wait too long for a slot and give up.

O.k., so let's take this model and use to see how TCP works and why! So TCP, the transmission control protocol, takes data like our movie and moves it from A to B in order and reliably.

It does this by breaking the movie into packets. Packets are better than sending the whole movie because if you're going to lose some data a long the way, it's better to resend only a small amount of it than the whole thing.

Now we know our packets might not make it, so TCP says that packets have to be acknowledged, which just means that the other end says "yes I got your packet" when it gets a packet.

So let's say we put a packet on the sushi belt, and it trundles along, and then it gets to the other end, and the other end takes it, and puts its own acknowledgment packet on the belt headed back to us. If we won't see that reply, we know to resend. O.k. great.

That's actually enough to get some reliability, but it's really slow. We'd really like to fill the belt with packets, and get much more data across in the same amount of time. To do this, TCP has something called "windows".

The window says "hey, feel free to send me this many packets, even if I haven't acknowledged them yet". Ideally you want the window to be as big as the number of free slots between you and the recipient.

So TCP tries to find this value: it starts out slow, and increases until it detects a drop - a packet that didn't make it, which is assumes is because there aren't enough slots. When that happens it reduces (often by a lot) the size of the window. It sort of "homes in".

On modern big fast networks, like AWS's, the default for this window size on common OS's is often too small, as is the number of "slots" in other layers. This is why increasing the TCP window size, and the TCP read/write buffers, and the ethernet queue lengths can be dramatic.

O.k. the next thing our model shows is that acknowledging every packet is kind of dumb; sending back a sushi roll for every one that you receive is just wasteful. We can acknowledge what we got, and what we didn't get, in batches!

The modern form of this is selective acknowledgements (SACK) ... where we can basically just scribble a note back on the sushi belt that tells the sender "hey, here's what I got and didn't get". The sender than re-transmit only what it needs to.

So if you do those two things: make your window size big, and make sure selective acknowledgements are on, you can make a big difference to your performance! That video can get to you more quickly.

Of course we do this kind of stuff ourselves for our own services, but if you're transferring data between your own machines or whatever, take a look!

now let's extend the model: I said that a pipe or link is like a sushi belt, but there are lots of links interconnecting! So it's like a stadium full of sushi belts, with packets hopping belts. It's like Tim Burton and the Coen brothers made a movie together.

when the belts interconnect, they might be moving at different speeds, or one might have less capacity than the other, so we have little holding areas, we call these "buffers".

Generally packets enter and leave these buffers in order, but in some networks you can have priority lanes here, giving priority to some packets over others. At AWS, our belts move so quickly and there are so many free slots that we don't need to do this, it'd be pointless.

But if there is congestion, and slots are busy, it's because senders are sending too much; it's key that they know quickly, so we make these buffers small. The problem of having these buffers too big is called buffer bloat (bufferbloat.net/projects/bloat…)

This can all be very confusing without the right mental model. Because we generally want one kind of buffer - the window size - to be big, but another kind of buffer - the buffers between links - to be small.

But if you see the network as train-cars or sushi on a belt, what you can see is that what we *really* want is to fill as many slots as we can when we're sending data! That's really all that's going on.

One problem with the metaphor: packets don't actually go in loops, they come at the other end, so unlike a sushi belt, there's a kind of off-ramp at each end. Also packets only enter and exit at the ends. There's really no perfect metaphor.

I'm going to meditate on better metaphors, so that's it for now :)