Challenges of distributed messaging systems with Derek Collison from NATS (Go Time #130)


I think that’s a great question… And like I said, I took a stance for probably almost 20 years, of “Oh no, just use whatever you’re gonna use.” I really wouldn’t even engage with people, because it would be a two-hour conversation. I would still probably lose the argument at the end, so to speak.

What’s happening now is that things are kind of coming back to it. So just to level-set - one of the things that I care deeply about with messaging systems is a couple things… And it’s not about messaging, especially not about the message broker from the ’90s type stuff; we need to not think of it like that. We need to think of it as a ubiquitous technology, kind of like network elements.

I was around when there was a single cable, and if you ever kicked the terminator cap off the end, it took the whole network down… And I saw the birth of hubs, and switches, and smart switches, and top of rack, and all the crazy network elements that we have… And just as well, a modern system needs to be doing that similarly. But to start out with - the first thing it does is it says “I am gonna do addressing and discovery not based on an IP import, but on something else.” You can call it a channel, a topic, a subject - I really don’t care what you call it.

Now, again, for a while, people were like “Why? This doesn’t make any sense.” But in today’s cloud-native world, I would argue what everyone is doing today makes no sense. We struggled so hard to change servers from pet to cattle, and yet we’re still saying “Oh, I wanna talk to Johnny’s service, so I need to know the notion of an IP port.” Now, I know what the audience is probably thinking, and I’ll walk through an example of what this really looks like in practice.

The other thing too is that messaging systems, when they do that abstraction, a lot of people call it pub/sub. And we still call it pub/sub, but again, we’ve gone away from that, because it’s kind of got a bad rep. But what I mean by pub/sub is that if the technology can understand multiple messaging patterns - one to one, one to n, m to n, and then one to one of n, meaning I can automatically say “Hey, I wanna send a message and only one of you in this set will receive it.” That’s kind of what NATS does at the very basic levels…

And folks always ask and they go “Oh, I just don’t need any of that stuff”, and I say “Okay, a couple things… One, decoupling is good.” Pets versus cattle was legit; let’s make sure our connected technologies follow the same path, and don’t say “No, no, this one’s special”, whether it’s an individual server, or a load balancer, or a special host… It just makes no sense in a modern world. So we push down on those things and we say “Okay, got it.”

[00:16:18.27] The last piece of advice I always give people from the ‘90s is “Never assume what a message…” - and a message could be a request, it could be data events, all kinds of stuff. But “never assume what the message is gonna be used for tomorrow.” So everyone kind of looks at me and says “What does that mean?” and I say “Okay, I’ll give you a really simple example.” When we talk about NATS these days, with a very fortunate, growing user base, and all kinds of crazy interest from customers and clients these days, is - modern architectures, distributed architectures are built using connected patterns, and there’s really only two. There’s lots of nuances to it, but there’s two. It’s either a stream or it’s a service.

The service is I ask a question, I get an answer, and a stream is I’m just sending a piece of data, and then I don’t care. And to level-set, distributed systems, even up to a couple years ago, were dominated - 98%+ of everything was a service interaction. I’m not saying it has to be synchronous, but everything was “I’m asking a question and getting an answer.” HTTP is “I ask a question and I get an answer” type stuff. So I said “Fine, on day one you know who’s gonna be answering that question.” So you coded up so that I’m gonna send a question to Mat, Mat’s gonna respond back with an answer. And you’re doing it on your laptop, and you use HTTP, or gRPC, or whatever, and you’re like “That’s all I need.” I go “Great.”

Let’s not even get to the point of anyone else being interested in the message. Just to start with “Okay, let’s go to production.” Well, we need more than one map. Oh, crap. Now we need a load balancer. Well, now we need to put stuff in, and do DNS… And that’s fine; production can handle that. I don’t have to worry about that. So then they do health checks, and then they have to figure out rerouting, and all this stuff… And all these big companies have playbooks on exactly how they do it, and they all look very similar, but they’re all slightly different.

Now let’s say someone says “Hey, for compliance we need to be able to watch all requests and we need to record them.” Now all of a sudden you’re gonna have to put a logger, and you don’t want the logger in place of a request-response, which is a massive anti-pattern I see being proliferated these days. It’s like “Oh, no… Put a messaging system in between it, and store it on disk, and try to retry, and stuff like that, in line with the microservice, in line with the service interaction. I’m hyped up Red Bull, but that’s the dumbest thing I’ve ever heard. It’s like Google saying “We’re gonna write all the log entries before we return your search results.” It’s just foolish, no one would ever do that. But there’s a need to say “Hey, I need to be able to write these things down, and someone else is gonna look for anomaly detection, or any type of policy enforcement, whatever it is.

So look at that system when you’re getting to the end of the day, and now let’s say we actually wanna spread it out from East Coast to West Coast, to Europe, and you need Anycast, and DNS, and all kinds of crazy stuff and coordination on the backend of state. They become really complicated, and they’re trying to get around the fact that everything is just point-to-point; it is naturally a one-to-one conversation. Whereas with a messaging system, you write it, you run it in that server, let’s say, or whatever, but think of the NATS server is extremely lightweight. It can run on a Raspberry Pi, in your top of the rack switch, you can run it in your home router, you can plug and play and put these things together in any arbitrarily complex topology that spans the globe… And that’s another discussion on distributed systems that aren’t really distributed.

So you set it up on your laptop and you run the NATS server. Its Docker image has been downloaded 150 million times… It just runs. For example, a subsystem of GE doing nuclear stuff runs our server for two years at a time, with no monitoring, no anything. And when they come in to change things inside of that nuclear reactor type stuff, they shut it down and figure out if they wanna upgrade it or not.

[00:19:59.03] So it’s lightweight, it’s always there, it’s ubiquitous, it just works. So now all of a sudden you write the same program, you’re sending a request, getting a response, doing it on your laptop… I would argue you’ll take the same amount of time, or possibly less, but it’s on the same level. You do have to run the server.

But now when you go to the next level of production - I need more mats. Well, just run more NATS – I mean, maths, not NATS. “Oh, well do I have to put in deployment a framework like Kubernetes and a service mesh?” No. I don’t care how you run them. Run them on bare metal, in a VM, in a container, in Kubernetes… It does not matter, run them anywhere. The system automatically reacts.

By the way, you haven’t configured anything on a NATS server yet, ever. Now, all of a sudden you’re like “Okay, well what happens if I wanna do compliance and I need to watch all the requests coming in?” Just start a listener for it. It’s got all of those patterns built in, so there’s nothing changing between me who’s asking the request and Mat who’s giving the response; we have to change. As a matter of fact, we don’t even have to be brought down and restarted; we’re just running along and people can come up and bring up anomaly detection, logging… All kinds of stuff.

So as you keep making these systems more production-ready and complex, you realize that messaging gives a huge win over the other ones. Now, the other ones are kind of known. People know how to put up load balancers, and know how to do logging, and know how to do all this stuff… But when you see something running on a service mesh and you haven’t even sent a single request check and you’re spending $6,000 a month… And I can show you, we can do 60,000 requests a second, with all of the servers latency tracking, all transparently, doing it for real, and it runs on a Raspberry Pi… That also translates to OpEx savings, which is a big deal.

NATS has always been known for how fast it is, but most people tell me “But we don’t need to go that fast. We don’t need to have a server doing 80 to 100 million messages a second.” And I go “I know”, but if you think about it, if you take that same thing for your workload and put it in the cloud, you can save 80% on your OpEx budget.

So do people need messaging systems to build stuff? Of course not, because everything for the most part is built on essentially HTTP, which again, is an interesting one to me, unpopular opinion… But I know why we did it that way, and we don’t have a reason to do it that way anymore and get it stuck, right?

The notion of client-server or request/response in the old days was the requester and the responder were usually inside the same network firewall… Not firewall specifically, but essentially we’re inside the company. And everyone started to say “Hey, I want the requesters, whatever those things are, to be able to walk outside of the corporate network.” So all of a sudden people started doing this, they go “We can’t get through the firewall”, and the firewall people, or the old DB people, they go “No.” It doesn’t matter what you ask them, they say no.

So people kind of went “Wait a minute… Port 80 is always open. We can piggyback off that and circumvent this whole thing. So we can just do request-response on HTTP or HTTPS, and it works.” And it’s true, and I remember doing some of those tricks myself. We’re not in that world anymore; that makes no sense whatsoever to build a modern distributed system off of technology that existed for something totally different and was a workaround around security and firewall systems.

Break: [00:23:23.14]