RLP's Blog - Bootstrapping Your Way To Being An SRE



I’ve had a lot of people ask me how to break in to the SRE end of the tech pool (these things will happen when you send out a blast email about your work to 3400 eager new college graduates).

The career path for people wanting to be developers is fairly well understood these days, but the ops side of things is extremely opaque. Also, no-one teaches it formally in any serious way that I’ve ever heard of (feel free to email me if you have counter-examples, I’d love to hear about them!).

It’s basically an apprenticeship-driven field, which leaves total newbies out in the cold. And yet, I do in fact interview people who have never had an SRE job that I would be pleased to hire, and go through that apprenticeship process with them. This document attempts to describe the experiences and mindset that those people have, and how you can get that for yourself, if you actually want to do this sort of work.

Please note that people who enjoy this sort of work are rare, and you really don’t want to do this sort of work unless you enjoy it; you will, eventually, come to hate it, and then you’ll start being bad at it, and that’s no good for anyone. You can read my essay on tech worker natures for a long-form discussion of this (note that as of Oct 2018 it’s not actually done yet, but there’s a decent amount there).

Terminology

“Systems Administration” / “Sysadmin”, “DevOps”, and “SRE” all refer to the same basic thing, which is being responsible for the maintenance and operation of computers running a service of some kind for users. It may or may not (usually not) include responsibility for the service software itself. Where those terms vary is in how much time is spent directly manipulating systems (i.e. typing commands to change a system’s state) vs. writing automation and scripting to manage system state for you, vs. writing software that manages your automation and makes automation decisions for you. Sysadminning is more of the former and SRE is more of the latter, typically, but it varies wildly between teams, from “little or no time spent writing real, complex software” to “writing software is basically all we do”. Normal is somewhere in the middle, with an overall industry shift towards more software writing in the last few years.

Manipulation of systems is often referred to as “operations” or “operational work”, and a team that does more of it as “ops-heavy”. A team that does more code writing is often referred to as “dev-heavy”.

I will use the term “SRE” throughout this document, even when referring to operations work, mostly because it’s shorter to type.

It’s worth noting that even on teams with a heavy software focus, building software as an SRE isn’t typically a multi-month process the way it is on pure development teams. Again, there are certainly exceptions (the most notable being Google, which treats “SREs are developers” as something of a religion, or at least that’s how it appears from the outside).

My Basic Suggestion

Play with systems. Build your own. Hook them together. Draw pictures of the structures you build (one of the resumes that most impressed me was the one that came with a Gliffy (or something like it) diagram of a 3 tier systems architecture she’d been playing around with).

What systems to play with and how to do it is a bigger topic. Let me be lazy and point you to a few other people’s ideas, and then I’ll add some of my own.

Other People’s Writings

The best discussion I’ve found on the topic of how to play with systems to level up your SREing from scratch is Getting Started in Systems Administration and Automation in a “DevOps” world. This is a great list of steps to take to get used to both the tools and the mindset.

It’s important to note that for someone who should be doing SRE work, more-or-less all the things suggested there are fun. Like, I personally don’t get a big kick out of playing with monitoring, but everything else on that list I’d play with for hours for fun on a weekend. If that’s not you, consider whether this is what you want to be doing.

Interested in becoming a Site Reliability Engineer? is a great post on the mindset behind dev-heavy SRE work. The first video, in particular, summarizes the SRE mindset really well. For the practical side, all the digitalocean tutorial links are fantastic.

Here are some books people have recommended in this space; I’ve not read this myself. My only strong book recommendation is Limoncelli, which has its own section below.

Another book in this space I have read is Site Reliability Engineering: How Google Runs Production Systems. It was pretty good, but as a guide it’s only really helpful if you’re already a programmer, and frankly people who are fundamentally programmers are not what I want in an SRE anyway, although that’s certainly how Google does things.

More Of My Thoughts On What To Play With

Honestly, I don’t have a lot to add to the two blog posts I linked above on this topic, but I will say that I recommend building ye olde LAMP stack for reasons that’ll become clear in the next section.

By this I mean a webserver, an application server to run an app you have chosen or written, and some kind of database back end. Then you can expand it out; add another web server, load balancing of some kind,

  • in your house

  • web service
  • rebuild it on AWS
  • get a sysadmin to break it for you

How You Prove You Can Do The Work

  • server fault
  • OOS projects
  • run a thing on AWS free and link people to it

I’m not a member of any of these myself, but I’ve heard good things about them:

Hi!

I’m representing a group of hiring managers at COMPANY. We found your resume via the Grace Hopper Conference resume database, and we’re looking for candidates to interview for full-time positions.

The work we do doesn’t fit very well into the questions that the Grace Hopper system asked you when you uploaded your resume, so we’re hoping you can help us!

We’ve included an abbreviated job description below, as well as a description of what day-to-day work looks like for the jobs we’re hiring for. Most of our available positions are in Sunnyvale, California.

Please review the descriptions and reply this email as follows:

▪ If you are interested in jobs at COMPANY, but not exactly this sort of job, please send email to ghc-blast-2018@group.COMPANY.com with the sort of job you’re interested in, your resume in PDF format, and the date you’re available for full-time work in the US. I make no promises about what will happen in this case, but I will pass these responses on.
▪ If you are interested in internships at COMPANY, please send email to ghc-blast-2018@group.COMPANY.com with the sort of internship you’re interested in, your resume in PDF format, and the dates you’d be available for internships. I make no promises about what will happen in this case, but I will pass these responses on.
▪ If you are definitely interested in the sort of work described by the job descriptions, but you don’t think you’re qualified for it yet, please send email to ghc-blast-2018@group.COMPANY.com saying so and I’ll send you some suggestions on how to get up to speed in this sort of work.
▪ If you are definitely interested in the sort of work described by the job descriptions, and you're available for full-time work in the US before June 2019, please send email to ghc-blast-2018@group.COMPANY.com saying so! Please include your resume in PDF format, the date you’re available, and a brief note about what you’ve done in the past (whether work experience or not) that matches well with the job descriptions, especially as it pertains to Linux operations work. We’ll make sure a real, actual human reviews your response!

Some important things you should be aware of:

▪ This email is going out to many people, so if you don’t follow the instructions, it might take us a long time to review your response!
▪ Most SRE positions do not involve working on any customer-visible COMPANY products. We run the machines that run the web services; we don’t work on the web services ourselves, typically.
▪ Most SRE positions do not, typically, involve writing new software. The coding we do is more about modifying and debugging extant software. There is definitely some programming involved, you do need to know how to write shell scripts and be able to script in at least one of Ruby, Python or Perl, but these are not developer positions.
⁃ In particular, despite the mention of Java below, my group is not hiring for Java developer positions. Some teams are supporting Java-based software, so familiarity with Java and JVM tuning is useful, but that’s not the same thing.
▪ Almost all of our work is on Linux systems, and Linux experience is a requirement.
▪ We love to see support experience (IT, helpdesk, that sort of thing); please mention if you have any.
▪ If you have any direct experience with Linux systems administration, DevOps, or SRE work, please mention that.
▪ There seems to have some confusion when I used the word “customers” in a previous mail. On most teams, your peers at COMPANY are the users of the systems you support. Very few of these jobs involve interacting directly with COMPANY customers.
▪ Many of you mentioned your love of COMPANY and, in particular, how much you enjoy hacking iOS apps. You should be aware that part of onboarding at COMPANY includes signing a document that says that you will not publish or maintain any iOS apps while you work for us.
▪ An on-call rotation, or on-call schedule, means that, when you are on call, you will be paged in the middle of the night if the systems your team works on are having problems. It has nothing to do with time spent working on different teams.
▪ If we set up an interview with you, it will probably not be at the Grace Hopper Conference, so if you can’t go that’s fine.
▪ Most of you do not live in the SF Bay Area, so you should be aware that is *absurdly* expensive to live here. COMPANY definitely pays well enough to do OK around here, but still, this is something you should actively research before proceeding.

Thank you for your time!

Robin Powell, SRE Manager @ COMPANY

Job Description:

The Internet Services Operations team is looking for Site Reliability Engineers to build and run the services that hundreds of millions of customers use every day. We are hiring high quality engineers with a diverse set of experiences and skill sets for positions on COMPANY’s public facing web properties & internal services. The best candidates will have both strong Linux / Systems expertise and demonstrated Software Development skills. Our customers count on us to provide extraordinary availability, scalability and security for services.

In this job you will be expected to do things like:

  • Collaborate effectively, in a friendly and professional fashion. On most teams, your peers at COMPANY are the users of the systems you support.
  • Manage system security, keeping security issues top of mind in all aspects of your work.
  • Automate repetitive systems management tasks (such as package installations, writing of configuration files, user management, that sort of thing) using any of a number of internal tools and configuration management systems. Most teams use Puppet in one form or another, although there are also teams that use Chef and Ansible. A drive to automate is at the core of what makes a great SRE!
  • Debug and modify tools and systems written languages like Ruby, Python, Bash (shell scripting), Go or Perl.
    • Many (but not all) teams are also looking for Java experience, and experience performance-tuning JVM applications.
    • On some teams you might write entirely new pieces of software to make your team’s job better, although for many teams new software is not a core part of the job.
  • Debug problems with Linux systems, or with the connections between them. Many of our teams do not use much virtualization, so knowledge of Linux internals is essential. Some knowledge of networking is also important.
  • Participate in an on-call schedule, the severity of which varies wildly by team.
  • Bring your best to internal technical discussions, trying to both learn and teach whenever you can so that we can make the best decisions in every situation. Great collaboration is a huge part of our work.

Day-To-Day For SREs:

(If you’re interested in the long-form version of this section, I recommend the two volumes of “The Practice of System and Network Administration (3rd Edition)” by Thomas Limoncelli.)

“Systems Administration” / “Sysadmin”, “DevOps”, and “SRE” all refer to the same basic thing, which is being responsible for the maintenance and operation of computers running a service of some kind for users. It may or may not include responsibility for the service software itself. Where those terms vary is in how much time is spent directly manipulating systems (i.e. typing commands to change a system’s state) vs. writing automation and scripting to manage system state for you, vs. writing software that manages your automation and makes automation decisions for you. Sysadminning is more of the former and SRE is more of the latter, typically, but it varies wildly between teams, from “little or no time spent writing real, complex software” to “writing software is basically all we do”. Normal is somewhere in the middle, with an overall industry shift towards more software writing in the last few years.

The main things a sysadmin needs are the ability to learn quickly, a desperate need to solve problems, and an intense desire to automate (that is: to make computers, instead of humans, do repetitive or well-understood work). Note that “automation” in this sense means things like automatically installing packages, writing config files, setting up users, that sort of thing. Test automation is a different, but related, set of skills. Both are necessary for most SRE roles, but there is usually a lot more systems automation than test automation.

The skills listed below aren’t necessarily skills we expect you to start with, but they are skills you need to be interested in developing.

On most teams, a significant chunk of work will be a longer term project (typically “longer term” is weeks to small numbers of months). This typically takes from a third to 80% of your time. Examples of projects: - Upgrading an entire fleet to a new version of some piece of software (Skills: Linux, Shell Scripting, Configuration Management (Puppet)) - Building out systems in a new datacenter (Skills: Linux, Shell Scripting, Configuration Management (Puppet), Networking) - Creating a new internal tool, or adding a significant feature to an existing one (Skills: Automated Testing, Programming (typically Python, Ruby, Java or Perl))

The rest of your time will be smaller tasks performed as they come up. Note that for most SRE groups, your customers are other people within the company. Some examples: - Answering an internal support request (i.e. a customer needs help using a tool your team supports) (Skills: Team Knowledge, Friendliness, Teaching) - Adding a minor feature to a tool, typically by customer request (Skills: Programming (typically Python, Ruby, Java or Perl)) - Debugging of various kinds: - systems-level problems, like a host that can’t boot or a network connection that can’t be made (Skills: Linux, Shell Scripting, Networking) - software problems, like a tool that crashes on valid input (Skills: Debugging/Problem Solving, Programming (typically Python, Ruby, Java or Perl)) - performance problems, like a database that is unexpectedly slow under normal load (Skills: Debugging/Problem Solving, Linux, Knowledge of the software in question, Networking) - security problems, like a host that is running software it shouldn’t, or a tool that is allowing something unexpected (Skills: Debugging/Problem Solving, Linux, Security Analysis, Networking) - Teaching a customer about what tools you support and/or how to use them properly (Skills: Team Knowledge, Friendliness, Teaching)

Almost all SRE jobs have an on-call component. This means you’ll be part of a schedule, typically 1 week on and 3 or more weeks off, during which if something goes horribly wrong you will get alerted. The frequency with which you get alerted will vary wildly between teams, from “2-3 per week” to “5+ per night, plus a bunch during the day as well”. In the latter case, you will typically not be expected to produce other useful work when you are oncall. What happens when you get alerted might be things like: - The alert is a known false-positive, so you acknowledge it and go back to sleep. In a good SRE position, this will be followed by real work to make the false positive stop happening. - The alert has a known procedure executed with it, which you execute. In a good SRE position, all such alerts will eventually be automated away. - The alert is new and surprising. You will use your knowledge of the system in question and its software, as well as general Linux and Networking knowledge typically, to attempt to debug and resolve the problem. If you can’t, you’ll call other people to help. Such issues are usually resolved in an hour or so, but most long-time sysadmins can tell you of that one call that lasted 12+ hours.

Seeking A Foundation-Building SRE/Systems Administrator

Do you want to influence the tools and processes of hundreds of SREs who collectively manage tens of thousands of machines?

My team builds, maintains, and manages the back-end systems that are behind the back-end systems that are behind the biggest COMPANY services, such as iCloud, iTunes, the App Store, and Maps. That wasn’t a typo; we build the systems that build the machines that run those services. We have direct influence over more than a hundred thousand bare metal boxes scattered all over the world, almost all running Linux (and almost all are OEL specifically). Our customers are back-end SREs, both for the services mentioned and several others. Our work typically starts when the system is racked and on the network, and ends when it’s ready to run user-facing software.

(Hi! I’m the actual hiring manager for this position. You can reach me at robin_powell+jpaug2018@COMPANY.com , and you should feel free to do so. Please feel free to share this job description with qualified people you know. This is my personal work email address, so I ask that it not be shared on the general internet or with recruiters.)

Since a lot of people aren’t familiar with SRE work, you’ll find a description of what day-to-day work looks like for us further down.

Exactly what you’ll end up doing in this role is extremely flexible, but here’s some things you should feel confident you could accomplish your first year:

▪ You will have taken charge of one of our major systems, becoming the subject matter expert. These systems include, but our not limited to:
⁃ Our machine provisioning system (homegrown on top of Kickstart) and attendant low-level systems (i.e. DHCP and PXE troubleshooting)
⁃ Our software packaging and package distribution system (homegrown on top of mrepo and Jenkins)
⁃ Our login/directory system (aka LDAP)
⁃ Our monitoring and alerting system (entirely homegrown; alerting has a mild resemblance to Nagios) and making sure that all alerts are actionable
⁃ Our DNS system, which pulls data from our homegrown CMDB in extremely complex ways
⁃ Front-line support (we all do this, but having someone with that as their primary job is helpful)
▪ You will have become known to our customers, and to your peers, for your expertise in your chosen area.
▪ You will have kept a keen eye on security issues in every project you work on, and you will have contributed to improving security in the systems that were already in place.
▪ You will have contributed great code to our configuration management system (Puppet on top of a homegrown CMDB) or any of our many automation systems (which are mostly Perl and Ruby, but also some Python and a bit of Go, and of course tons of shell scripting (Bash) to hold it together).
▪ You will have become known to our customers, and to your peers, for your helpfulness, ability and willingness to teach and mentor others, and friendly demeanor.
▪ You will have become known to our customers, and to your peers, for your expertise at debugging and fixing operational issues, such as problems with system configuration, system provisioning, and user access.
▪ You will have successfully influenced people on other teams to adopt our tools, or improve their automation, or any number of other persuasions that will make their lives better and reduce toil.
▪ You will have participated in our on-call schedule (which, honestly, is pretty light as these things go; maybe 3 off-hours events per week) and will have contributed to making it better and reducing the toil associated with it.
▪ You will have actively participated in many, many discussions inside our team and with other teams designed to identify and pursue the best solutions to our automation and systems management problems. You will have brought to these discussions your strong opinions and respectful, collaborative attitude.
▪ You might have also established a particularly strong relationship with a single other team, such as the monitoring team, the security design team, or one of the property SRE teams. This will have allowed you both to influence them more effectively in their pursuit of automation and toil reduction, and to keep the rest of our team apprised of upcoming initiatives that we need to know about.
▪ You might have also taken on the challenge of writing an entirely new piece of automation, including customer-facing documentation, operational documentation, extensive automated testing, operational design, release and deployment. There’s not a lot of room in our environment for entirely new pieces of automation, though, so this doesn’t happen often.

Day-To-Day For SREs:

The best book I’ve found for the general philosophy and mindset of this field is the two volumes of “The Practice of System and Network Administration” by Thomas Limoncelli (the third edition), so if you’d like the long-form version of this section, read that. (In fact, I buy copies for all of my subordinates.)

“Systems Administration” / “Sysadmin”, “DevOps”, and “SRE” all refer to the same basic thing, which is being responsible for the maintenance and operation of computers running a service of some kind for users. It may or may not include responsibility for the service software itself. Where those terms vary is in how much time is spent directly manipulating systems (i.e. typing commands to change a system’s state) vs. writing automation and scripting to manage system state for you, vs. writing software that manages your automation and makes automation decisions for you. Sysadminning is more of the former and SRE is more of the latter, typically, but it varies wildly between teams, from “little or no time spent writing real, complex software” to “writing software is basically all we do”. Normal is somewhere in the middle, with an overall industry shift towards more software writing in the last few years.

The main things a sysadmin needs are the ability to learn quickly, a desperate need to solve problems, and an intense desire to automate (that is: to make computers, instead of humans, do repetitive or well-understood work). Note that “automation” in this sense means things like automatically installing packages, writing config files, setting up users, that sort of thing. Test automation is a different, but related, set of skills. Both are necessary for most SRE roles, but there is usually a lot more systems automation than test automation.

My team is a bit ops-heavy; we do very little writing of serious software (i.e. software that takes more than a couple of weeks to write). We do do a lot of automation, however.

The skills listed below aren’t necessarily skills we expect you to start with, but they are skills you need to be interested in developing.

On most teams, a significant chunk of work will be a longer term project (typically “longer term” is weeks to small numbers of months). This typically takes from a third to 80% of your time. On my team, it’s typically about 50/50, but it depends on individual preference. Examples of projects: ▪ Upgrading an entire fleet to a new version of some piece of software (Skills: Linux, Shell Scripting, Configuration Management (Puppet)) ▪ Building out systems in a new datacenter (Skills: Linux, Shell Scripting, Configuration Management (Puppet), Networking) ▪ Creating a new internal tool, or adding a significant feature to an existing one (Skills: Automated Testing, Programming (typically Python, Ruby, Java or Perl))

The rest of your time will be smaller tasks performed as they come up. Note that for most SRE groups, your customers are other people within the company. Some examples: ▪ Answering an internal support request (i.e. a customer needs help using a tool your team supports) (Skills: Team Knowledge, Friendliness, Teaching) ▪ Adding a minor feature to a tool, typically by customer request (Skills: Programming (typically Python, Ruby, Java or Perl)) ▪ Debugging of various kinds: ⁃ systems-level problems, like a host that can’t boot or a network connection that can’t be made (Skills: Linux, Shell Scripting, Networking) ⁃ software problems, like a tool that crashes on valid input (Skills: Debugging/Problem Solving, Programming (typically Python, Ruby, Java or Perl)) ⁃ performance problems, like a database that is unexpectedly slow under normal load (Skills: Debugging/Problem Solving, Linux, Knowledge of the software in question, Networking) ⁃ security problems, like a host that is running software it shouldn’t, or a tool that is allowing something unexpected (Skills: Debugging/Problem Solving, Linux, Security Analysis, Networking) ▪ Teaching a customer about what tools you support and/or how to use them properly (Skills: Team Knowledge, Friendliness, Teaching)

Almost all SRE jobs have an on-call component. This means you’ll be part of a schedule, typically 1 week on and 3 or more weeks off, during which if something goes horribly wrong you will get alerted. The frequency with which you get alerted will vary wildly between teams, from “2-3 per week” to “5+ per night, plus a bunch during the day as well”. In the latter case, you will typically not be expected to produce other useful work when you are oncall. On my team normal is about 1 after-hours page each day. What happens when you get alerted might be things like: ▪ The alert is a known false-positive, so you acknowledge it and go back to sleep. In a good SRE position, this will be followed by real work to make the false positive stop happening. ▪ The alert has a known procedure executed with it, which you execute. In a good SRE position, all such alerts will eventually be automated away. ▪ The alert is new and surprising. You will use your knowledge of the system in question and its software, as well as general Linux and Networking knowledge typically, to attempt to debug and resolve the problem. If you can’t, you’ll call other people to help. Such issues are usually resolved in an hour or so, but most long-time sysadmins can tell you of that one call that lasted 12+ hours.

So I’ve got a bunch of people who replied to a “looking for SREs” mail blast I sent out. Some of them are … not ready. I would like to be able to reply with “Here’s some things you can do if you want to re-apply some day”. I can easily write this myself (it’d basically be “Go read Limoncelli, and then construct multi-tiered architectures on AWS and play with them”), but it occurs to me that someone else must have already written this thing. Does anyone know of a “here’s how to learn SRE/Ops/DevOps for newbies” doc?

11 replies go.erik [21 days ago] As a manager of an “SRE” team who stumbled into founding this team out of need rather than strategy, could you tell me what Limoncelli book you are referring to? Sounds useful.

Also sent to the channel rlpowell [21 days ago] What I mean by Limoncelli is https://www.amazon.com/gp/product/0321919165 and https://www.amazon.com/gp/product/032194318X . I buy a copy of both for every member of my team (although the former is way more important). I also have a cheat sheet of which chapters are most important, because they are tomes; I can share that list if people want. (edited)

go.erik [21 days ago] Excellent, I will check those out. Thanks!

rlpowell [21 days ago] YW!

Steve [21 days ago] Agree on those. I began my learning with an earlier version of https://www.amazon.com/dp/0131480057 and it was excellent as well.

Glenn Stone [21 days ago] List please?

Sarah Zelechoski [21 days ago] https://medium.com/@tammybutow/graduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b Medium Interested in becoming a Site Reliability Engineer? Hi there,

(248 kB) Dec 10th, 2016 at 15:36

Sarah Zelechoski [21 days ago] this is my favorite resource :point_up:

rlpowell [21 days ago] My capsule review of Limoncelli, with chapter suggestions. limoncelli_capsule_review.txt So I just finished both The Practice of System and Network Administration (3rd edition, just came out) and The Practice of Cloud System Administration, and thought I’d share. The short version is: these books are absolutely essential for anyone in a sysadmin or DevOps role, and very important for any customer-facing developer that wants their stuff to not break (in that latter case, especially the Cloud book). What surprises me most about them is that no-one made me read them. People recommended them at various points, but no-one actually said “You must read this it is a requirement of your job”. Having read them, this seems an absurd mistake, and I am in fact requiring my subordinates to read at least some of them. Click to expand inline 133 lines

rlpowell [21 days ago] Thanks Steve and Sarah, super helpful.

tomg [21 days ago] That medium post is amazing

So I just finished both The Practice of System and Network Administration (3rd edition, just came out) and The Practice of Cloud System Administration, and thought I’d share.

The short version is: these books are absolutely essential for anyone in a sysadmin or DevOps role, and very important for any customer-facing developer that wants their stuff to not break (in that latter case, especially the Cloud book).

What surprises me most about them is that no-one made me read them. People recommended them at various points, but no-one actually said “You must read this it is a requirement of your job”. Having read them, this seems an absurd mistake, and I am in fact requiring my subordinates to read at least some of them.

This is by far the most comprehensive reading I’ve ever heard of on why we do what we do, and why it matters that we do it in particular ways.

The problem is that these books, especially the first one, are Giant Tomes(tm). Really very large. I have some advantages when it comes to reading giant books1, so I thought I’d share my thoughts on which parts are especially important and worth reading:

Essential Parts

Entirety of Part 1 (i.e. chap 1-4)

“Even Mirrored Disks Need Backups” — just before Section 15.2

9.1.3 Leveraging a CMDB

10.3 Configuration Management Database

Ch 22: We’re trying to do DR work as an org right now, so this is pretty essential.

Chapters 29-31. These are more important for philosophy, rather than specific steps. You might already know all of it, but maybe not, and it’s important to find out what you might not know.

Ch 47 is important.

Ch 49-52 are extremely important. 53 and 54 are important for managers.

Ch 55 and 56 are for when you are doing basically OK as a team and you want to go for being awesome.

Appendix A is a great index into tasks (i.e. “I’ve just found out I need to upgrade the OS on every host in 5 datacenters; now what?”)

Volume 2

All of part 1 except Ch 3 are direcly applicable, and basically boil down to “Here’s what modern service-oriented architectures look like”. If you’re simply being handed a solution and you don’t care why the architecture is the way it is, you can skip them, but for next-level sysadminning you’ll need to know this stuff.

Ch 7 & 8 are absolutely essential.

Ch 12 is essential.

Appendix B is a great history lesson

Volume 1

6.1.3 Product Line Selection (just a page or so)

Chap 13 - Server Hardware Strategies

Chap 14 - Server Hardware Features (less important)

15.1 [Server] Models and Product Lines

Chapter 7 is not all directly applicable, but several parts of it are foundational and it’s not easy to point to the good parts, so, the whole thing.

Chapter 8: Intro, 8.1, you can skip 8.2 as it boils down to “automate your OS installation”, 8.3, 8.4

Managers should probably read Chapter 12

Ch 16: Intro, 16.1, 16.6 ; 16.7 is a fun read but not important for us

16 and 17 are needed if you’re building a new service

17.4.3 Dependency Alignment

17.5 Decoupling Hostname from Service Name

17.6 Support

Ch 18: All, although I ignored the details of the math myself

Ch 19: Only if launching a new service or rebuilding an old service pretty much completely. This definitely had influence on the Puppet 4 process.

Everything in Part 4: Services is worth reading, I’m just calling out the most important bits.

Ch 21: Again, this had a considerable influence on the Puppet 4 process

Ch 23 and 24 are great if you need to brush up on your networking.

Ch 25 and 26 are great if you want insight into what GDCS needs to put up with. :)

Ch 27 & 28: Worth reading if you do a lot of customer-facing work. It’s more useful for the attitude than for direct applicability because it’s about helpdesks, but still worth it.

Ch 33 is good for how to identify all the parts of something you’re gonig to upgrade and be sure you hit everything

Ch 36 is good for deciding how centralized a particular service should be

Ch 38: All except 38.2, 38.4.1 (SNMP)

Ch 43 is generally useful if you’re directly managing storage. 43.3.3 is important.

Ch 44 is important if you directly manage non-trivial backups, especially to tape.

Ch 45 is essential for my team; not so much if you’re not managing software repos.

Ch 46 is important and not very long, although it may be a bit basic for some of you.

Volume 2

Ch 3 if you’re trying to choose between managing your own systems or using something like PIE or other internal cloud-like solutions.

Ch 9 and 10 are important if you are part of a release management team (i.e. if you deploy in-house software, which my team is unusual in not doing)

Ch 11 is important if you need to know how to upgrade live services in place. The concept of a flag flip is especially interesting.

Ch 14 is important if you’re not happy with your oncall process.

Ch 15 is important if you’d like your systems to fail less.

Ch 16 is somewhat redundant with the monitoring bits in vol 1, but more formalized.

Ch 17 is only relevant if you run your own monitoring.

Ch 18 is important for everyone given that ISO now has an official lead time of 6 months on new systems.

Ch 19 is about how to measure your awesome.

Ch 20 is about how to increase your awesome; it’s redundant with 55 and 56 from vol 1

o I’ve got a bunch of people who replied to a “looking for SREs” mail blast I sent out. Some of them are … not ready. I would like to be able to reply with “Here’s some things you can do if you want to re-apply some day”. I can easily write this myself (it’d basically be “Go read Limoncelli, and then construct multi-tiered architectures on AWS and play with them”), but it occurs to me that someone else must have already written this thing. Does anyone know of a “here’s how to learn SRE/Ops/DevOps for newbies” doc?

cquinn [17:11] @rlpowell Are you looking for a bunch of internet fights with strangers?

Because a non-trivial portion of the folks to whom you send that will view it as an invitation to negotiate their way into the job. Badly.

rlpowell [17:14] On a directly related note, anybody have any experience with https://linuxacademy.com/ ? linuxacademy.com Online Cloud, Linux, DevOps & Certification Training | Linux Academy Linux Academy provides the most in-depth training and certification courses for Linux, AWS, Azure, Google, OpenStack, DevOps, Big Data, and Containers.

rlpowell [17:16] @cquinn I feel very strongly about SRE as an apprenticeship discipline; if I help one person find the perfect career for them, that’s worth 100 clumsy attempts at lies that I just ignore. (I reviewed probably 2-300 clumsy attempts this weekend, so I’m allowed to say that. :smile: )

I particularly enjoyed the people who were literally saying things like “I am the best possible person for this job” that as far as I can tell had no Linux background. -__-

tomg [17:32] I “read” all the limoncelli I could and you’re right, I feel qualified to do everything! Pasted image at 2018-09-26, 10:32 AM

@rlpowell I would read that if you wrote it. (edited)

rlpowell [17:37] Don’t say that; now I’m going to try to do it well. -_-

tomg [17:40] The lack of a… defined “operations” learning path is tricky.

I have a junior dev here who wants to sit with me to ‘learn operations’ and I dont have the foggiest idea of how to start that with him.

jason [17:42] Start with Infra as Code. When the next firestorm hits, tell them to buckle up.

aphill70 [17:43] My boss wants me to start mentoring ppl to do what I do… I am not sure what I do… So figuring out how to help people get there is daunting. I fill a very sre devops role for his org (edited)

jason [18:00] What I do, and I guess my gift is seeing a much broader swath of the system than most developers seem to. The “dev” will be intimately familiar with the code they’ve written. I’ll have a whiteboard box view of what their code does, but I’ve also got that diagram, and the other thirteen whiteboard diagrams in my head, and how that box connects to the next 47 pieces of the world, and why it’s important (or not). With that “state” diagram, I can (sometimes) make leaps of judgement from faint tremor in sub-system 17a to the real cause which might be packet per second limits on the AWS instances the message bus happens to be running on, or premature LRU cache aging in Varnish (or Redis), to Cassandra tombstoning not keeping up with the delete frequency, etc. DevOps is the amalgamation of: 1) Old school operations know how 2) A deep focus on automating away toil (first and biggest step is getting 99% of your infra under code control) 3) An obsession with metrics, and a drive to understand which metrics are important, and when they get out of bounds (aka when to (ideally) take an action to self-repair or (sub-optimal) wake someone up) 4) Not required, but highly pervasive is a touch of humor somewhere between “dark” and “gallows”. 5) A desire for better communications in all things. That state diagram doesn’t get detailed out in a void. (edited)

^ My opinions. Others will have their own definitions of DevOps

tomg [18:13] I don’t use the word devops anymore. I try and use a descriptive word for groups of task that someone would perform. Eg I just watched an internal session on “intro to devops” which should have been titled “intro to our build and release process”

jason [18:14] That’s fair. And I tend to refer to myself as an Infrastructure Architect when people ask.

(or Infrastructure Nerd)

tomg [18:14] 1) Old school operations know how I dont know how I would teach this or what I’d suggest a young person do to get it.

jason [18:16] @cquinn has been talking about that a fair bit in Screaming in the Clouds recently. The old path of “IT Helpdesk” -> “Junior Windows Admin” -> “Linux Admin or Senior Windows Admin” is closed.

cquinn [18:17] Recorded an episode today with an intern.

jason [18:17] Some people just fall into it (I did… but I’m also a second-generation Sysadmin… which is borderline unheard of)

tomg [18:18] I think a lot of people just fall in.

jason [18:18] They fall into it via raspberry Pi, or a second boot partition, or now docker.

tomg [18:23] This is a bit #careers, but I have nowhere left to ‘fall’. So climbing from middling sysadmin to something like @rlpowell’s SREor ’Infrastructure Nerd" is super desirable. It at least feels like there are several ways to get there.

But to start from scratch? yeesh.

aphill70 [18:48] I really like that description @jason I fell into this from the dev side… Signed on to a team and had to learn scaling and ops the hard way

Rafael Fonseca [18:52] @jason you just described exactly my approach

Like, in order, even

Steve [20:30] 1) Old school operations know how Go bang your head against all sorts of weird problems for at least a decade. I’m only half joking. Dev is bounded by rules, but requires creativity. Ops appears to be bounded by rules, but the role has a large dose of solving the problems that are (unintentionally) caused by people, who are infinitely creative in their (unintentional) chaos, so it requires persistence and calm. There is a reason these concepts were/are separated: it’s very hard to find someone who can see both perspectives and also has interest to do both things well. (edited)

Rafael Fonseca [20:32] I wish there was a better way to expose newcomers to the hard problems that shaped most seniors in these roles

In a “here’s an intensive 16-week course on trash fires” way (edited)

tomg [20:32] @Rafael Fonseca make the newbies delete a database at 5pm Friday and check back in on them on the Monday. :+1:

Rafael Fonseca [20:33] Heh that’s one option :joy:

Steve [20:55] A lot of that formative work has been outsourced to AWS, GCP, and other cloud providers. The Ops folks behind the scenes there must still be dealing with inexplicable hardware issues, blinkenlights, kernels, etc. Or work for an ISP - similar issues. The problem is that once you’ve gone through that, you have to heavily shift to the other areas, and that can be a rough job transition, because those skills aren’t valued in the same way anymore. I wonder if there will be a shift back to respecting those skills in another decade or two, once everyone who came up through that pain is out of the job market. Anyway, more helpfully, https://lopsa.org may have some material that could be of use for this component. lopsa.org LOPSA - Home The League of Professional System Administrators (LOPSA) is a nonprofit corporation with members throughout the world. Our mission is to advance the practice of system administration; to support, recognize, educate, and encourage its practitioners; and to serve the public through education and outreach on system administration issues.

law [21:02] +1 for LOPSA. We do have a mentoring program, which helps a bit

masonoise [21:03] The other problem is that people think because they’re on AWS/GCP/etc they no longer need to care about “old school operations know-how”. Which is sadly not the case if you’re doing anything more complicated than serving web pages from an S3 bucket.

Steve [21:07] The people who actually get that aren’t writing job descriptions.

proffalken [23:42] @rlpowell I’ve written one, hope it helps. It focuses on learning and understanding how the Operating System works, then how to manage that OS and how to monitor things properly, and finally how to automate the configuration and deployment of said systems. What it doesn’t focus on is the cultural side of things, but that’s mainly because every organisation I’ve worked for/consulted into has had a slightly different definition of what “being Agile” and “doing DevOps” means, so this was intended to get the basics of Systems Administration and Automation down, then they can learn how your particular org does the rest…

https://doics.co/2017/01/26/getting-started-in-systems-administration-and-automation-in-a-devops-world/ DevOps Is Common Sense…Matt Macdonald-Wallace Getting Started in Systems Administration and Automation in a “DevOps” world I saw a post recently on a forum from a developer asking “How do I learn DevOps?”. I started to reply on the forum but ran past the character limit so I thought I’d publish a post on my thoughts in the hope that it helps developers move into either a greater understanding of the systems administration side of things or get started with managing their systems.

rlpowell [23:43] Yay, thank you.

proffalken [23:43] (feedback is appreciated, if I’ve missed anything, please do let me know, I wrote it a while back… :smile: )

tomg [00:04] I read that a while back and found that it;s target audience was a developer. Is that accurate?

I read it again and: mostly

tomg [00:19] I read it again and no it’s actually pretty broad.

proffalken [00:20] hopes that @tomg keeps reading it over and over - I’m interested to see who the final audience really is… :p (edited)

tomg [00:22] I am in prime “Type before I think” mode.

proffalken [00:29] Heh, if it helps, it was originally aimed at the developers I encounter who say “Meh, being a SysAdmin isn’t that hard, all you need to do is install dockerd and you’re done! It’s nowhere near as difficult as programming in ”, and inspired by a LinkedIn forum question where someone asked “How do I learn DevOps?” which was full of answers that were “learn docker/rancher/k8s, that’s all you need to do, then use AWS ECS/GKS/etc.”

I’m sure we’ve all found that once you start to helped developers understand that in order for them to serve their Angular frontend at sub-millisecond speeds it takes a massive amount of infrastructure and fine-tuning, they start to want to work with you and not just blame you for their problems, at which point the product improves, and the business does better, so you work closer together, and the product improves, and the business makes more money, and it repeats ad infinitum. At that point, you’re working collaboratively, and then (and only then!) you’re doing DevOps! :smile:

tomg [03:11] On the sobject of blogs, is there anything out there that will define what SRE means at the moment? I have a rough idea but I want more info and I dont want to read the whole Google SRE book just yet

jason [04:38] That’s the canonical text at the moment.

kms [05:26] There is a followup: http://shop.oreilly.com/product/0636920063964.do shop.oreilly.com Seeking SRE SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implemen…

rlpowell [06:17] Other than being a bit self-congratulatory, it’s a pretty good book. FWIW.

But, I mean, I still buy all of my subordinates Limoncelli and not that book, which I suppose indicates which I think more captures the spirit of what we do. What we do (my team specifically) is pretty ops-heavy, though.

Steve [09:46] One of the things that is hard to teach is the myriad of unexpected ways in which things can go wrong. It’s relatively easy to teach the right way to do a thing, but when an exception arises, how do you get back to that state?

One of the best classes I have ever taken was way back in the Sun Micro days, maybe circa Solaris 8 or so. They offered a class on fault detection. I think it was focused on E10K/E15K, but there were occasional gems that arose, like what happens if someone does this? rm -f /dev/null ; reboot (edited)

I haven’t seen another class of that sort, focused on intentionally breaking systems in odd ways, since.

masonoise [09:51] Two things come to mind for me when thinking about our product engineers. The first is that it feels like engineers aren’t taught to think about failure modes, like the above – what could happen to cause their code to fail, and what can/should they do about the possibility? The second is they tend not to think outside their code’s surface area. Systems thinking is not common with younger developers these days: the code runs in an environment with other things, and everything from communications to cpu and memory is important. So those are areas where SRE partnerships are particularly important, at least for us.

rlpowell [10:39] replied to a thread: What I mean by Limoncelli is https://www.amazon.com/gp/product/0321919165 and https://www.amazon.com/gp/product/032194318X . I buy a copy of both for every member of my team (although the former is way more important). I also have a cheat sheet of which chapters are most important, because they are tomes; I can share that list if people want. (edited)

William O’Neill [11:05] Agreed that Limoncelli’s books are great references for Ops folks. Even seasoned folks that I’ve suggested those books to found something worthwhile in them.

rlpowell [11:32] I read them after 15+ years as a sysadmin type person and was blown away, so +1, for sure.