Threat modeling is one of the most important parts of the everyday practice of security, at companies large and small. It’s also one of the most commonly misunderstood. Whole books have been written about threat modeling, and there are many different methodologies for doing it, but I’ve seen few of them used in practice. They are usually slow, time-consuming, and require a lot of expertise.

This complexity obscures a simple truth: Threat modeling is just the process of answering a few straightforward questions about any system you’re trying to build or extend.

  • What is the system, and who cares about it?
  • What does it need to do?
  • What bad things can happen to it through bad luck, or be done to it by bad people?
  • What must be true about the system so that it will still accomplish what it needs to accomplish, safely, even if those bad things happen to it?

For the sake of brevity, I’ll refer to these questions as Principals, Goals, Adversities, and Invariants. (And, in fact, that’s the name of the rubric I’m about to present.)

A good threat model also includes a system diagram, but we leave that out of the rubric nameit’s long enough already.

When we make a practice of asking these questions, try to answer them at least somewhat rigorously, and write down our answers somewhere other people can find them, threat modeling is truly revolutionary.

None of this requires specialized training or knowledge, nor does it require you to be asecurity person.” All that’s required is curiosity and an interest in learning what kinds of bad luck and bad people have happened to other systems.

It also doesn’t take all that long or all that many people to answer these questions. I’ll usually give it an hour to be thorough, but even a 15-minute conversation one-on-one can produce something actionable.

At Akamai, where Brian Sniffen and Michael Stone initially developed this rubric (and many others, including myself, extended it), we used it every day to collect knowledge and communicate with each other, the rest of the engineering org, and the broader company, so that we could build better and safer products for the benefit of the company, our users, and ultimately the world.

If you want to truly understand a system, study how it fails.

Let’s say you’re my hypothetical friend Alícia, who just got hired as an engineer for a small software as a service (SaaS) company, Kumquat.

Shortly after she joined, the company’s app stopped sending out email verification and password reset emails. By noon, there were a number of frustrated user comments on the company and the CEO’s social media feeds. The CEO flagged the issue to the engineering team, and Alícia, along with Shruti, an engineer of longer tenure on the team, sat down to investigate.

First, they checked the company’s third-party email gateway service. Fortunately (or perhaps unfortunately), it appeared to be operating normally, and there were no known service interruptions listed on its status page. When they logged into its dashboard, they saw that the send queue was empty, its connection to the Kumquat backend was fine, and mail was being processed normally. But they were sending hundreds upon hundreds of messages with the subject lineACT NOW: Reactivate your Kumquat account.

Acting on a hunch, Alícia IMed her friend Jayla on the business team, and quickly discovered that the account reactivation emails were part of a dunning campaign Jayla had organized.

Previously, the company had only tracked active user growth and revenue in aggregate. Jayla, the company’s first dedicated business analyst, had noticed that while active user growth kept increasing, revenue wasn’t tracking linearly.

Analyzing the company’s user data, she discovered that a number of customers’ subscriptions had lapsed, most commonly because their credit cards couldn’t be charged, but they were still being allowed to use the service. She had conceived and gotten buy-in to run a dunning campaign to encourage these users to update their payment information.

The emails were overloading something, but Shruti and Alícia weren’t sure where they were coming from or what was being overloaded. Jayla said an engineer named Hana had done the work on the backend. They soon learned that Hana had written a job for the cron service to send the emails and scheduled it to run at 6 a.m.

Alícia realized that she didn’t understand how the cron service fit into the system as a whole. When she joined the company, Shruti had given a talk on the overall architecture, and Alícia remembered seeing a diagram, but she hadn’t retained much of it.

The engineers got together in a room with a whiteboard so they could talk through what was happening. Shruti drew the diagram:

Then they began to talk through it.

The job running on the cron service is pulling a list from the database of the users whose cards haven’t been charged for at least two months. Then it’s creating an email job for each one,” Hana explained.

And the cron service sends the email jobs to the job queue service,” Shruti said.The job queue service hands them out to workers, which use the email gateway provider’s API to send the email.

Alícia added a simplified sketch of the bit of the system they cared about next to Shruti’s diagram.What have we been using the job queue service for until now?” she asked.

Just sending emails,” Shruti replied.Email verifications and password resets.

Alícia wrote “Principals” on the whiteboard and underlined it.So this is the system.” She gestured at the simplified diagram.And its users are verifying their email addresses, resetting their passwords, or updating their credit card information.

Users who want to…– Verify their email address– Reset their password

– Update credit card information