Welcome to “Serverless Superheroes”!
In this space, I chat with the toolmakers, innovators, and developers who are navigating the brave new world of “serverless” cloud applications.
In this edition, I chatted with Corey Quinn, Cloud Economist and founder of the Quinn Advisory Group. The following interview has been edited and condensed for clarity.
Forrest Brazeal: Corey, you’re known for your snarky cloud commentary that finds its way into your weekly newsletter, Last Week In AWS. How did you become what I can only describe as a “professional AWS gadfly”?
Corey Quinn: I like that description! Once upon a time, I was a grumpy old Linux sysadmin. I eventually wound up starting up a consultancy a couple years back, aimed at an expensive problem that I’d had to solve a number of times, which is fixing the horrifying AWS bill.
I started realizing pretty quickly that virtually everything Amazon does has an impact on economics, and keeping track of what’s changed week over week was an impossible task, so I built an automated system to give me the information I needed.
That got me 80% of the way to having a newsletter. The first edition of “Last Week In AWS” went out to five hundred and fifty people, and it’s been gaining pretty steadily ever since.
It’s sort of turned into its own miniature cult phenomenon. I added a podcast a few months back as well, and increasingly I’m seeing a groundswell of people who appreciate being kept informed, along with people who enjoy the sarcastic level of snark that lives inside of every issue.
That may not be two distinct sets of people. I don’t know if it’s cynicism or hope that keeps people going, but one way or another it seems to resonate.
You are a self-described “cloud economist”, but I thought the cloud was supposed to save me money? If I’m not spending money on servers and provisioning all my hardware anymore, how come my bills keep getting larger and larger?
The first issue is sprawl. People spin up an EC2 instance on demand, forget about it, and there’s no system out there to close the loop.
Historically, take a look at how corporate IT works. If you request a system to be provisioned, you might have had to fight with IT for several weeks. When the infrastructure finally got provisioned, you held onto it as tightly as you could so it wouldn’t get taken away from you.
That provisioning process now takes seconds in the cloud, but people still don’t have the impulse to shut things down — after all, who knows whether you’ll need it later? Nobody wants to go through that personal ordeal a second time.
There’s also a lack of awareness about what’s actually in the cloud account at a given time and who put it there. Is this resource critical to production, or is it just sitting there? It’s not at all clear without switching it off and seeing who screams their head off because their service just collapsed.
There’s a lack of visibility, there’s a lack of trust, and there’s also a lack of understanding. A cloud bill is incredibly complex, it’s huge once you get into the minutiae of it, and you very quickly get to a point where you need to be a subject matter expert to make sense of it.
I freely admit that my job shouldn’t exist, and I look forward to the day it doesn’t so I get to focus on more interesting problems.
If I had to boil all that down into a single word, it would be discipline. As a consultant, how do you drive financial discipline across the engineering organization?
Introducing people in engineering to people in finance is a big part of it. On top of that, there’s also an automation story around how you can do relatively simple things in an environment that wind up extending into large gains down the road. But there’s a big question that I think all of these tools tend to ignore. They fall into the engineering trap of: “Oh, you have a cost problem? I have some code that will fix it.”
When you talk to a company that spent four million dollars this month instead of three, it’s not the million dollars that bothers them so much as the fact that they didn’t predict it — they didn’t see it coming. There was no model of unit economics that dictated this, or can say what it means for their projections 18 months out.
Do you think there’s more need in the cloud for engineers to think about the financial impact of their work? Or has this always been a problem that the cloud is now exacerbating?
I think it’s a different model. When you’re doing things in the datacenter, the capital expense is generally a sunk cost up front. Then you have a bunch of systems sitting there. You can talk about efficiency, but when you get right down to it you have spent X dollars and received a bunch of hardware, and how you will use it is really the only open question.
The cloud, on the other hand, will scale faster than your budget will. And there’s no forcing function to say “Oh wow, we’re out of servers already. Time to buy more!” The cost grows incrementally. And as the complexity of systems increases, the potential for waste gets worse.
So what are some common mistakes that people make when they’re dealing with cloud spend?
People tend to believe early stage narratives from their own company long after they they stop being accurate. I talk to companies who are convinced that their developer environments are incredibly expensive, that they need to find a way to turn them off at night.
Then we do an analysis and we learn that the developer environments are something on the order of three percent of their bill. Production is what has scaled up. That’s where the low-hanging fruit is.
So I do listen to what my clients tell me, but I ignore almost all of it when I start looking at the account, because a lot of times some of the most valuable things you think you can apply turn out to be irrelevant. I think our shared stories are important, but only on a second pass.
What’s the most horrifying cloud spend situation you’ve seen?
Well, a lot of them seem to involve reserved instances. It’s such a big check to write up front that people get stuck in analysis paralysis for eight months trying to figure out what to buy. But one-year RIs generally break even in seven months, so what have you gained by sitting there and not doing anything?
At least twice I wound up finding that some double digit percentage of a client’s bill was managed NAT Gateway data processing. All they had to do was move that workload into a public subnet, or run their own managed NAT instances, and that entire category went away with no impact to their architecture or their environment. We’re talking millions of dollars in savings.
I understand that the LastWeekInAWS.com stack is fully serverless at this point. Can you walk us through the architecture at a high level, and explain what happens between when people write articles that you want to share and when those articles wind up in my email inbox?
There’s about four different serverless workflows involved there. I have a bunch of RSS feeds that gather content for me. Anything I see that I like, I throw into one of three categories: the community group, the AWS blog which is generally auto-populated, and the tools section. The curated links live in Pinboard, a bookmarking service.
From there I have a Lambda function that fires off once an hour thanks to a scheduled CloudWatch event. That dumps everything from Pinboard to DynamoDB, where it sits until it’s time to write the newsletter. Then I populate a Jinja template in Lambda, I fire off a series of validation checks, Lambda builds a rendered HTML blob, and … well, then like some kind of ancient caribou I copy and paste that HTML into a web form and hit submit.
There are also serverless workflows that do simple things like add headers to every GET request because CloudFront is “special.” I also built a confirmed-loop opt-in system via Lambda that I’ll get around to open sourcing one of these days. There’s remarkably little competitive advantage here.
What lessons have you learned building out this stack? What would you do differently now?
Now, I would throw the entire thing away and use a site called curated.co, which I recently discovered does almost exactly what I need for newsletter creation. Live and learn.
But there are two silver linings. One, this architecture has turned into several conference talks, so that tends to be handy. And it’s kept me hands-on in a way that I appreciate.
The ability to sit here and toy with AWS announcements is great, but remember, I used to be something like an engineer. If I do analysis but don’t ever implement anything myself, it doesn’t take too terribly long before I lose the hands-on experience, and as a result of that a lot of authenticity.
Engineers tend to know when you aren’t actually down in the trenches with them, don’t they?
Exactly. There are some services I almost never talk about — historically, data stores have never been a strong area of mine. So if you’re expecting me to come up with a nuanced critique of Neptune, I’m sorry, the closest I can get is: “graph DB? that sounds like a giraffe, what other zoo animals have their own databases?”
You are pretty outspoken in your distaste for “multi-cloud” approaches. Is there ever a reason to build a system on multiple cloud providers?
Yes, and there’s always a very specific use case. For example, PagerDuty is very publicly multi-cloud, or at least the thing that wakes you up when something breaks is multi-cloud. There’s a great reason for that: if there were to be a complete single cloud outage, and PagerDuty was hosted on that cloud, nobody would get paged!
But that is a single, special-case workload. I haven’t checked with them, but I would bet you a lot that their login page, a lot of their front-end work, their marketing stuff is not multi-cloud. The cost/benefit tradeoff isn’t there.
So if you’re building a thing that is aimed at being able to withstand an entire system-wide AWS outage — incidentally, something we have never seen because they lack a shared control plane for all regions — then okay, great, I’ll have that conversation with you. What is the service you’re providing, and how critical is it on a day when some ridiculously high percentage of the internet is currently on fire?
I feel like I’ve known executives who would shake their heads at that question and say “Sorry, we need to be always online.” How do you convince them otherwise?
The same way I used to convince management when I was working as a systems administrator. I started my career in email systems and you’d have this conversation a lot. What is the SLA required for email — how much downtime is acceptable? The answer was always “no downtime is acceptable.”
Okay. So to be clear, 80% of the world is destroyed by nuclear fire, but your corporate email server still needs to be up? And a few people say yes. Great! To start, I’m gonna need twenty billion dollars. I’ll let you know when that runs out and we’ll go back for more.
Suddenly we’re having a business conversation. By the time that we go back and forth a few rounds, when they say “no downtime is acceptable” they mean they need to be able to check their email from 9:00 to 5:00 and occasionally outside of business hours.
Last year when there was a four-hour outage to the standard (US East) region of S3, we saw a lot of knee-jerk reactions where everyone was moving all their things over to multi-region buckets, doing replications, setting up multi-homed things. You’re going to double or triple your infrastructure costs by doing that.
Is it important to your business to be able to withstand the once-every-seven-years, black swan outage of an S3 region? There are cases where the answer is yes. There are many times, however, where if some of the icons on your front page don’t render, maybe that’s not necessarily worth the effort you’re going to put into architecting a multi-cloud solution.
And it’s not just about the infrastructure cost. There’s a complexity cost tied to this: it complicates your architecture diagrams, makes it more difficult for people working on the system to understand what it’s doing, and none of that is insurmountable — but it all adds up.
Remember, we’re not hobbyists in our day jobs. Our time isn’t free anymore; it’s incredibly expensive. If you’re “saving” $200K a year by running databases on top of EC2 instances instead of using RDS, but employing six people to run the database environment, have you really come out ahead?
And yet we see a lot of vendors out there who are trying to push abstractions and solutions that are higher than any specific cloud. Do you feel that those folks are putting their emphasis in the wrong place?
It depends. Take, for example, Terraform by HashiCorp. It’s one of my favorite tools; I think it does a lot of things very right.
Mitchell Hashimoto just had a big piece, on a Reddit comment of all places, talking about some of these things: the idea that even if you’re using a single cloud, something like Terraform has value because it can speak to virtually anything with an API. So if you’re using a DNS provider that isn’t Route 53, or using a CDN that isn’t CloudFront, you can still integrate it in the same stack.
That said, there are no silver bullets, and a tool that tries to do everything certainly can end up like the multifunction printer, which is a product and an analogy everyone can hate.
Finally, Corey, I may regret asking this, but … what’s the most controversial cloud opinion you hold?
Hmm. How about this: 90 to 95% of all of the data people are shoving busily into data warehouses is solving exactly one problem, and that is that people need to waste money so I can provide relief from their problems.
It’s dead data that’s not going to turn into this wonderful bonanza that everyone thinks it is. Big data, machine learning, AI — there are narrow use cases where these things make sense as tools, but by and large people are trying to pour them on everything just like they are with blockchain.
But hey — combine all those words together into a startup and you just raised eight million dollars in a seed round!