ooo-yay.com

Despite there being two books written on the subject of Site Reliability Engineering there are often varying accepted definitions for the implementation of the practice. In fact, depending on who you ask DevOps and Platform Engineering are basically the same thing as SRE. That's to say, you can be upset that the definitions mean nothing or you can embrace it. In this post I'm going to try to distill down what businesses actually value in what we call SRE, DevOps, and Platform Engineering and translate that into a set of objectives.

Compute Scheduling Platform

This is a fancy word for the way you intend application teams to launch services. Services in this case can be internal or external and all the other requirements are likely to be custom to the businesses needs. You have three paths here:

Free-for-all
An industry accepted solution wrapped to fit your business
A completely custom solution

If life exists in a series of spectrums then options 1 and 3 put the result somewhere near the "complete hellscape" end of the spectrum. Option 2 is often the most economical, straight forward, and the one your users can adopt most quickly if they're familiar with the underlying technologies and principles at play.

Some examples

A Kubernetes cluster with a wrapper around kubectl.
Cloud provider accounts with Terraform modules users can consume.
A Docker Swarm cluster with a wrapper around docker compose and Docker context managers.

You could have branches from this of course; there's secrets management, networking, monitoring, and all sorts of other platforms that could be created and must integrate nicely to a compute scheduling platform. Your choice in compute scheduling will inform all the rest of those solutions though.

The goal

The goal of a compute platform is to simplify the experience of delivering software reliably and limiting the lower-level knowledge of systems and networking required to do so. Engineers writing applications already have many hats to wear from project management to quality assurance before they even have code to run, so we try to simplify all the minutae in between.

Service ownership model

Once you give people a place to run code you need to tell them how to run it because anyone given the wild West will assuredly do ~~foolish~~ creative things.

Definitions that are useful:

Core competencies. Every service has intended uses and things it always does. Know them, understand how they align to business objectives, and report on them whenever possible.
Core functionality. Core competencies can be broken down into one or more functions (or features). Functions or features that align to core competencies are core functionality.

Concepts that are useful:

Indicators. These are service level metrics that indicate how a service is delivering on individual pieces of its core competencies.
Objectives. Goals for an indicator or group of indicators. Most of these should map to core functionality.
Agreements. Agreements map to core competencies. They're agreements between the team and the business to provide a core competency. Whether it's part of a product or a capability of a platform - they're all the same.

Note: You can probably tell these are lifted and distilled (simplified?) definitions of SLIs, SLOs, and SLAs. People get really hung up on the "service level" part so I find it more fitting to map them in a way most businesses and engineers can understand and relate to.

Writing a policy is one thing, but there also needs to be standardization around how indicators, objectives, and agreements are documented, measured, and reported on. Ultimately, your selected compute platform will indicate how to solve this, but any sort of time series database and a visualization tool is a good start for measuring and reporting. Documentation of agreements, however, should be a centralized place that every stakeholder can audit and be aware of.

Why service ownership?

There are other methods and configurations you could apply to your strategy. Any method that does not leverage the existing human capital of a service team will ultimately run into scaling problems.

So when do I need a SRE team?

Let's first disambiguate SRE the job from SRE the practice. The practice of SRE more or less equivocates to the above material + cultural things like blameless post-mortems. With that out of the way, let's talk about SRE the job.

What makes a good SRE skillset?

There are generally three major flavors of SRE, but you may find more in the wild. Let's start off by talking about what every SRE should have in common.

All SREs should be proficient in writing application code. Application code in this sense would be a web application, command line tool, system daemon - whathaveyou. Someone who solely understands scripting or configuration DSLs will be limited in their approaches to problem solving.

All SREs should understand the SDLC and build systems. Afterall, if you make an application to solve a problem then you'll probably need to deploy it.

All SREs should have a basic understanding of networking, systems, and hypervisor technologies. Our world is increasingly entrenched in all of these things.

All SREs should understand a basic set of application and system architectures. Understanding how an application is built and then deployed as part of a wider set of applications is critical to systems design and competency when solving systemic issues.

All SREs should have a moderate understanding of statistics and asymptotics. Making sense of the data you see and doing some napkin math on performance, especially with large workloads, is pretty important for the job.

Now let's talk about the niches.

The first is a systems engineer. Generally these folks have a deeper background in Linux, building system tools, daemons, etc... In my experience, they often end up solving system level performance bottlenecks and compliance issues. The folks also tend to be very knowledgeable in networking.

The second is a software engineer. They'll have a plethora of experience with both compiled and interpreted languages.

The third I've encountered is a database engineer. Generally these folks have specialities in data storage services from RDBMS, nosql, key/value, etc...

Now, for my final point, let's circle back to service ownership. One of the key things for successful teams is a clear understanding of the ownership of their problem domain. If you're making a team of people comprised of the above skillsets then you should first ask yourself, "What will they own?" If what they own is not something that they directly control then their mission is much too vague.

Some examples of non-ownership

The tiger team. Tiger teams were made famous by NASA in the '70s. They ran around fixing and course correcting all of the things around a space shuttle before launch. They were generally highly skilled, tenured engineers who had broad understandings of the systems at play and how they worked in combination. While this all sounds wonderful it's often politically fraught in an enterprise. We are not building space shuttles. Creating a mandate from the top that a team like this should be invited open arms into every service teams repository to make changes at will will create untold amounts of organizational friction.

Frankly, if you really need this, that team is probably best suited to be hand plucked from your product development teams. Each of those engineers need not only the technical acumen to deliver, but also the kind of tenured political acumen to maneuver and align other teams on their own.

Central organizations. This is really a creative way of saying, "operations". In case someone has told you otherwise it is absolutely okay that, in the year of our lord 2024, you have an operations team that runs incidents as incident commanders, watches/creates dashboards, and creates reports for internal dissemination based on input from subject matter experts. Those people do not need to be of any of the skillsets mentioned prior in order to do that though.

Why you don't need SRE

If you subtract the anti-patterns from the thing you can build from those skillsets you start to land in what I described earlier as "platform engineering" or alternatively a "tools" team. These teams have clearly defined internal products that they are responsible for that can be clearly tied to a companies goals, capabilities, and competencies that the company then goes on to sell to customers.

SRE is more the practice of reliability engineering, which is something that a company and service teams do in a service ownership model. There are also some important cultural components that I'm glossing over like blameless post-mortems.

When companies outside of FAANG say "SRE" they often have a team that has nebulous ownership that's just researching "what's wrong" and constantly trying to turn some wrenches to make it less wrong. It makes much more sense, however, to let the people already familiar with their problems solve them and focus on giving them a lense through which to view them powered by tooling and definition.