Site Reliability Engineering

SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.

Our job is a combination not found elsewhere in the industry. Like traditional operations groups, we keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors. Unlike traditional operations groups, we view software as the primary tool through which our systems are managed, maintained, and minded; to that end, we have the source-level access and moral authority required to fix, extend and scale code to keep it working, harden it against the vagaries of the Internet, and develop our own planet-scale platforms. We hire people from both systems and software backgrounds, and an informed mix is even better. Just as what we do is unique, where we do it is unique too. In Google, we have the good fortune to have developed many large systems ranging from planet-spanning databases to near real-time scalable data warehousing to fault-tolerant datastream joining. In SRE, we flip between the fine-grained detail of disk driver IO scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions. We own those products in production. We drive reliability and performance across massive scale by mastering the full depth of the stack. We literally do learn something new every day – usually surprising things – and (for algorithm fans) there isn’t a small N anywhere in our job.