In the last few years, there has been much discussion about whether some IT operations professionals are facing obsolescence and which skills an operations engineer working on modern software should have. Some operations engineers are helping build large, complex distributed systems, along with self-service tools and application platforms. Others' roles have stayed relatively static.
Whether your organization prefers to separate developers (those who build the software) from operations (those who run the software), or if it works toward cross-functional competency, there are definitely tips for how most operations professionals can avoid obsolescence. Even organizations that are trying to imbue developers with all the necessary operations skills still need operations engineers to pave the way.
This resource list includes articles to help you understand the current state of IT Ops and provides a knowledge base for several key skills that are becoming more desirable in operations professionals, including distributed systems engineering and next-generation operational tooling.
The State of Analytics in IT Operations
Reviewing the state of ops
Susan Fowler is a key voice in the operations engineering community who has worked at Uber and is now at Stripe. In addition to being an engineer, she's the author of Production-Ready Microservices and a contributor to Increment magazine. This post gives some insights into how ops looks in some of the best software organizations. According to Fowler, top-tier modern operations engineers can't have just low-level systems expertise and strong Linux/Windows chops. They also need high-level distributed systems knowledge, monitoring expertise, and programming skills.
They must know how to deploy and scale systems, too. This article gives some great insight into how those skills cross over with software engineers' skills in various organizations. Be sure to check out her post "Who's On Call?" as well. Fowler believes that, as is the case at many top engineering organizations, developers, not operations, should be on the hook (on call, in other words) for responding to problems in their software.
"Ops is where beautiful theory meets stubborn reality." That's what Charity Majors, a founder and engineer at Honeycomb, says when giving a definition for operations. When DevOps caught on, developers had to learn more operations skills, and ops professionals had to learn more development skills. In this article, Majors shows why developers need to learn operations engineering to properly ship their software. One of my favorite things about this article is its distilled overview of all the changes in the discipline of operations in the past 10 to 15 years.
"We are all distributed systems engineers."
In what seems like a rebuttal to the previous article, Cindy Sridharan, a systems engineer at Apple, says that most developers aren't going to want—or be able—to become specialists in operations engineering in addition to software engineering. She agrees that there definitely needs to be more on developers' plates. However, she says, operations engineering involves a litany of skills (see the list below). It's a lot to ask of developers to handle all of that in a mid-size or large organization. It's also a lot for operations, and it's more efficient and logical in many cases—such as debugging, being on call, and some forms of monitoring—to have developers handle some of those tasks. Take a look at Sridharan's list of common tasks for modern operations engineers and see where you might have gaps in your skill set:
- Deploying applications
- Running edge proxies such as nginx
- Operating databases such as MySQL and MongoDB
- Operating caches such as memcached and Varnish
- Running message brokers such as Kafka
- Running systems such as Zookeeper and Consul
- Running a scheduler such as Kubernetes
- Configuring servers with Chef or Puppet
- Managing DNS setups across various hosted providers
- Renewing SSL certs
- Configuring firewalls, subnets, or VLANS
This article from Forrester Research looks at operations engineering trends from a more strategic level. It comes to the conclusion that many system administrators are already well on their way to becoming more like developers, and it refers to these modern operations professionals as "infrastructure engineers."
An excellent roundup article from Michael Stahnke, a site reliability engineer (SRE) at Puppet Labs, this piece looks at why the 2017 State of DevOps survey saw a sharp drop in the number of people who identified as system administrators. ("DevOps engineer" was the most common title.) Stahnke summarizes the thoughts of several experts who discuss the evolving roles of IT operations engineers. Not all people who call themselves "sysadmins" are stuck in the past—only writing bash scripts and manually managing servers—but there is a tendency for many sysadmins to take on new titles as they learn programming and distributed systems management.
Even though Majors shared her views on modern operations in the article summarized above, this podcast from Intercom is also useful because it looks at operations engineering from a startup's perspective. It answers the question: Do you need an ops team when your company is small and just starting out?
Distributed systems engineering: High-level knowledge for ops
Yes, technically you can say that two nodes and a network (or two cores inside a CPU) are a distributed system. But that's often not helpful. In this Twitter thread, Charity Majors and other IT professionals talk about what people in the software industry usually mean when they talk about distributed systems.
'The Evolution of Distributed Systems Management'
IT operations pros have significantly changed how they manage distributed systems over the years. This examination of eight different management strategies goes from the basic level of manual deployment and configuration all the way to a final future prediction of how operations engineers will manage distributed systems. It also functions as a maturity model in some ways—helping your organization determine how modern your distributed systems management practices are (although cutting-edge isn't always necessary). The best part about this article is that it shows you common examples of the tools engineers use at each stage. It's not an exhaustive list, but it gives you a more practical illustration than over-generalized, tool-agnostic advice.
In this article, which is aimed at inexperienced engineers, Jeff Hodges, the CEO of Darkish Green, shares key lessons he's learned from distributed systems engineering. He also recommends that new distributed systems engineers read two seminal works on the topic: "Fallacies of Distributed Computing," and the CAP theorem. This article and the next few resources will start to explore some of the deeper, computer science-focused aspects of distributed systems.
Kyle Kingsbury, the author of the "Call Me Maybe" Jepsen database testing blog series, doesn't just torture databases; he also does training on distributed systems fundamentals. While you'd have to pay for his training, he does have a useful outline of the course on GitHub, which can serve as a research map for the topic. You should check out his Jepsen testing series too.
A bunch of people have asked Caitie McCaffrey, a distributed systems engineer at Microsoft Research, how to get started in her field. In response, she authored this post, which includes books, research papers, blogs, and talks she found useful when she was learning the ropes. McCaffrey also suggests you read postmortem outage analyses from large tech companies, including Amazon, Netflix, Google, and Microsoft. You can keep up with interesting outages by subscribing to the SRE Weekly newsletter. For even more resources, check out the "Awesome distributed systems" repository on GitHub.
This "awesome" list on GitHub has an insane number of resources about system scalability, stability, performance, and availability. The list includes subsections on distributed caching, tracing, tracking, logging, messaging, event streaming, security, storage, and many other subsections focused on distributed computing.
Here's a more advanced reading list for learning about distributed systems engineering. It includes many of Google's groundbreaking papers on systems such as BigTable, MapReduce, Dremel, and Spanner.
This is an older—but still useful—introduction to distributed systems from the time when Google Code was still around. It covers the common communication strategies and several distributed design principles.
Distributed systems in the wild
Why didn't I share a resource on this topic sooner? It turns out that this is a big question that a lot of people in IT have differing opinions about. This post by Matthew O'Riordan, the co-founder and CTO of Ably, is just one recent opinion on the skill set a distributed systems engineer should have. It has some good ideas for distributed systems topics to research. For more examples of desirable distributed systems engineering skills, search for a few job postings.
ACM Queue always has great deep dives and explanations for complex IT topics. This piece by Mark Cavage, a vice president of engineering at Salesforce, explores two common use cases that often require a distributed solution. It walks through the two solutions with significant detail but doesn't give tooling-specific tutorials. The article's main purpose is to point out the key difficulties and hard realities of building a distributed system.
While operations engineers and software developers should definitely start building up their distributed systems skills, that doesn't mean the end goal is to always build your own distributed systems. The author, Jesse Anderson—the managing director of the Big Data Institute—shares his own horror story from when he first tried to build a distributed system, with the hope that you won't make the same mistakes.
Ever heard of evolutionary architecture? It's generally about building software architecture that can easily change or "evolve" over time—meaning the code and structure should be maintainable and not hard to change down the road. This article by Oliver Gierke, a project lead at Pivotal, offers some similar ideas around distributed systems—specifically, building just enough specification and communication and not encouraging the proliferation of either.
Kubernetes is quickly becoming a key tool for distributed systems engineers to learn, since it's the most popular container orchestration tool. "Basically Kubernetes is a distributed system that runs programs (well, containers) on computers. You tell it what to run, and it schedules it onto your machines." That's a quick definition from this article's author, Julia Evans, an engineer at Stripe. The post is a great introduction to Kubernetes from a learner's perspective. You should also check out many of the other posts on Evans' blog, since they are often helpful topic introductions for operations engineering subjects. Networking, for example, is another big topic that operations engineers need to understand well.
GO-JEK, an Uber-like company that's fighting a lucrative commercial battle in Southeast Asia, has some great engineers such as Rajeev Bharshetty, who shares some useful knowledge about how the company is making its microservices architecture more resilient after failures. In addition to defining some terminology and context, this article explores four well-known design patterns and how they provide resiliency in this use case.
Finally, we have a resource summarizing a research paper that takes a more theoretical look at the future of distributed systems engineering. In this blog post, Adrian Colyer, a venture capital partner and former CTO of SpringSource, explores Ph.D. candidate Christopher Meiklejohn's research paper. It's about a new type of programming specifically meant to solve the challenges of consensus in concurrent programming to build correct distributed systems. It will be interesting to see what sorts of tools and techniques continue to emerge to make distributed systems programming and management less error-prone.
Operations engineers as enablers
If you want to see what the future of operations engineering looks like, Netflix is one really good indicator. The engineers are divided into service teams, and each person supports the services and tools they write, meaning they are the ones who get called when their service goes down. In this podcast, Dianne Marsh, the director of engineering for engineering tools at Netflix, describes how many of the engineers at Netflix focus on building great tooling so that others can develop and deploy software effectively. This is one of the key areas that modern operations engineers should focus on.
Modern system administrators need to know a lot. This list of categorized, curated open-source tools can help sysadmins find what they need to do their job better and help developers work more effectively, too.
Get detailed resources for various types of open-source and commercial monitoring tools such as Nagios, Zabbix, Ganglia, and more.
This amazing resource from the engineers at Etsy is a full crash course in system administration and operations engineering. If you're new to the field and looking to learn the basics, or if you have some gaps in your knowledge, this should be your first stop.
New open-source tools are created for the DevOps community every day. The best way to keep up with the most interesting new tools is to read the tools section of this DevOps Weekly newsletter by Gareth Rushgrove, a product manager at Docker.
The keys to modern ops
As you read through these resources, you'll start to see a pattern that points to the key competencies that are going to be necessary for modern operations engineering:
- Knowing your sysadmin skills cold
- Low-level systems expertise that includes instrumenting and debugging those systems
- High-level distributed systems knowledge
- Knowledge of complex deployment processes
- Monitoring expertise (maybe with specialization in SRE)
- Networking and firewall skills
- Programming skills
Of course, what you need to know for any current or prospective job is usually pretty clear and specific, so learning all of this might be overkill in many contexts. But as a collection of broad skills that operations engineers are being asked to have, it's a pretty good list.
Got any other helpful resources to share about learning modern ops skills, or opinion pieces about what the future of IT Ops looks like? Share them in the comments below.
[ Upcoming Webinar (Oct. 23): Simplify Discovery and Change Management for Cloud and Container Environments ]