Renan Roggia's photo

Renan Roggia

I consider myself a tech problem solver.

Site Reliability Engineering

Table Of Contents

The notes

Foreword

Tools were only components in processes, working alongside chains of software, people, and data. Nothing here tells us how to solve problems universally, but that is the point. Stories like these are far more valuable than the code or designs they resulted in. Implementations are ephemeral, but the documented reasoning is priceless. Rarely do we have access to this kind of insight.

Preface

We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones. Sometimes, our task is writing the software for those systems alongside our product development counterparts; sometimes, our task is building all the additional pieces those systems need, like backups or load balancing, ideally so they can be reused across systems; and sometimes, our task is figuring out how to apply existing solutions to new problems.

a system isn’t very useful if nobody can use it! Because reliability is so critical, SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient.

it’s still worth putting lightweight reliability support in place early on, because it’s less costly to expand a structure later on than it is to introduce one that is not present.

Part I - Introduction

Chapter 1 - Introduction

The Sysadmin Approach to Service Management

Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.

At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.

The ops team attempts to safeguard the running system against the risk of change by introducing launch and change gates.

Google’s Approach to Service Management: Site Reliability Engineering

Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.

SRE is what happens when you ask a software engineer to design an operations team.

Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems.

The result of our approach to hiring for SRE is that we end up with a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated.

By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc.

Because SREs are directly modifying code in their pursuit of making Google’s systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change. Such teams are relatively inexpensive—supporting the same service with an ops-oriented team would require a significantly larger number of people.

DevOps or SRE?

One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.

Tenets of SRE

SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

Ensuring a Durable Focus on Engineering

by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.

Postmortems should be written for all significant incidents, regardless of whether or not they paged

This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time

Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Pursuing Maximum Change Velocity Without Violating a Service’s SLO

thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.

  • What level of availability will the users be happy with, given how they use the product?
  • What alternatives are available to users who are dissatisfied with the product’s availability?
  • What happens to users’ usage of the product at different availability levels?

Once that target is established, the error budget is one minus the availability target.

An outage is no longer a "bad" thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

Monitoring

Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

Alerts

Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.

Tickets

Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.

Logging

No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.

Emergency Response

Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR). The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the MTTR.

Humans add latency. Even if a given system experiences more actual failures, a system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands-on intervention

When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a "playbook" produces roughly a 3x improvement in MTTR as compared to the strategy of "winging it."

Change Management

SRE has found that roughly 70% of outages are due to changes in a live system

  • Implementing progressive rollouts
  • Quickly and accurately detecting problems
  • Rolling back changes safely when problems arise

Demand Forecasting and Capacity Planning

Capacity planning should take both organic growth (which stems from natural product adoption and usage by customers) and inorganic growth (which results from events like feature launches, marketing campaigns, or other business-driven changes) into account.

Provisioning

Adding new capacity often involves spinning up a new instance or location, making significant modification to existing systems (configuration files, load balancers, networking), and validating that the new capacity performs and delivers correct results. Thus, it is a riskier operation than load shifting, which is often done multiple times per hour, and must be treated with a corresponding degree of extra caution.

Efficiency and Performance

Resource use is a function of demand (load), capacity, and software efficiency.

My Summary

Historically, companies use the sysadmin model. Which is responsible for dividing the Developers and Operations. At its core developers and operations have different goals. Developers want to deliver code to production as fast as possible. Operations wants to have a reliable system. Since, changes might cause outage, there is a tension between developers and operations goals. This model is also expensive due to the need to add more people as the software scales.

Google's SRE teams are composed of software engineers hired to do operations. These teams have the skill to automate the process and can be rapidly bored by repetitive tasks. Therefore, creating a great combination for continuous improvement.

The SRE skills are very similar to developers skills, UNIX and networking being the main differentiators.

It's important that these teams focus on engineering. Otherwise, it will be required to increase the staff of the teams. To achieve the focus on engineering, Google provides a cap of 50% of work for engineering. If they workload on operations exceeds this number, bugs, on-call and so on are redirected to the development team.

SRE can be considered a Google's implementation of DevOps.

SRE team is responsible for the:

  • Availability

    There's no gain to have a 100% reliable system, because there are other players into play with lower reliability (WiFi, ISP, power grid, ...)

    The availability is a product decision. The error budget is value of permitted unavailability. Developers may use this budget to speed up releases. The aim is not to have zero outages, but to not exceed the error budget.

    Postmortem it's important to expose faults and improve systems. A culture of fixing errors instead of hiding or minimizing them.

  • Latency
  • Efficiency and Performance
  • Change Management:

    • Implementing progressive rollouts
    • Quickly and accurately detecting problems
    • Rolling back changes safely when problems arise
  • Monitoring:

    Keep track of a system’s health and availability. Software monitors human are notified when need to take action.

    • Alerts: Human needs to take an action immediately in response to something that is either happening or about to happen.
    • Tickets: a human needs to take action, but not immediately. If a human takes action in a few days, no damage will result
    • Logging: Recorded for diagnostic or forensic purposes. People read the logs when something prompts them to do so.
  • Emergency Response:

    How quickly the response team can bring the system back to health. Adding a "playbook" produces roughly a 3x improvement.

    • MTTR (Repair)
    • MTTF (Failure)
  • Capacity Planning:

    There is sufficient capacity and redundancy to serve projected future demand with the required availability

    • Organic growth (natural adoption)
    • Inorganic growth (feature launches, marketing campaign, ...)

Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE

System Software That "Organizes" the Hardware

Given the large number of hardware components in a cluster, hardware failures occur quite frequently. In a single cluster in a typical year, thousands of machines fail and thousands of hard disks break; when multiplied by the number of clusters we operate globally, these numbers become somewhat breathtaking.

Networking

In order to minimize latency for globally distributed services, we want to direct users to the closest datacenter with available capacity.

Our Development Environment

When software is built, the build request is sent to build servers in a datacenter. Even large builds are executed quickly, as many build servers can compile in parallel. This infrastructure is also used for continuous testing. Each time a CL is submitted, tests run on all software that may depend on that CL, either directly or indirectly. If the framework determines that the change likely broke other parts in the system, it notifies the owner of the submitted change. Some projects use a push-on-green system, where a new version is automatically pushed to production after passing tests.

My Summary

This chapter they describe Google's Environment.

The cluster operating system Borg handles resource allocation. Borg is the precedent of Kubernetes. r

Part II - Principles

Chapter 3 - Embracing Risk

It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer

Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.

Managing Risks

We strive to make a service reliable enough, but no more reliable than it needs to be.

That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.

Measuring Service Risk

As standard practice at Google, we are often best served by identifying an objective metric to represent the property of a system we want to optimize. By setting a target, we can assess our current performance and track improvements or degradations over time.

For most services, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime.

Time-based availability

Availability = uptime / (uptime + downtime)

Aggregate availability

Availability = successful request / total requests

Most often, we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis. This strategy lets us manage the service to a high-level availability objective by looking for, tracking down, and fixing meaningful deviations as they inevitably arise

Risk Tolerance of Services

SREs must work with the product owners to turn a set of business goals into explicit objectives to which we can engineer. In this case, the business goals we’re concerned about have a direct impact on the performance and reliability of the service offered.

Identifying the Risk Tolerance of Consumer Services

  • What level of availability is required?
  • Do different types of failures have different effects on the service?
  • How can we use the service cost to help locate a service on the risk continuum?
  • What other service metrics are important to take into account?

Target level of availability

  • What level of service will the users expect?
  • Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?
  • Is this a paid service, or is it free?
  • If there are competitors in the marketplace, what level of service do those competitors provide?
  • Is this service targeted at consumers, or at enterprises?

Types of failures

Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.

Because most of this work takes place during normal business hours, we determined that occasional, regular, scheduled outages in the form of maintenance windows would be acceptable, and we counted these scheduled outages as planned downtime, not unplanned downtime.

Cost

It may be harder to set these targets when we do not have a simple translation function between reliability and revenue.

Identifying the Risk Tolerance of Infrastructure Services

A fundamental difference is that, by definition, infrastructure components have multiple clients, often with varying needs.

Motivation for Error Budgets

The product developers have more visibility into the time and effort involved in writing and releasing their code, while the SREs have more visibility into the service’s reliability (and the state of production in general).

Instead, our goal is to define an objective metric, agreed upon by both sides, that can be used to guide the negotiations in a reproducible way.

Forming Your Error Budget

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter.

  • Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
  • The actual uptime is measured by a neutral third party: our monitoring system.
  • The difference between these two numbers is the "budget" of how much "unreliability" is remaining for the quarter.
  • As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Benefits

The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.

If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on.

If the team is having trouble launching new features, they may elect to loosen the SLO (thus increasing the error budget) in order to increase innovation.

My Summary

The cost:

  • The redundant compute resources
  • The opportunity cost

Typical tension:

  • Software fault tolerance
  • Testing
  • Push frequency
  • Canary duration and size

Chapter 4 - Service Level Objectives

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service.

Indicators

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

Most services consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second.

Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable. It is often defined in terms of the fraction of well-formed requests that succeed, sometimes called yield.

Although 100% availability is impossible, near-100% availability is often readily achievable, and the industry commonly expresses high-availability values in terms of the number of "nines" in the availability percentage.

Objectives

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.

SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.

Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow.

Agreements

Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs

Indicators in Practice

What Do You and Your Users Care About?

Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined

Collecting Indicators

However, some systems should be instrumented with client-side collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.

Aggregation

Most metrics are better thought of as distributions rather than averages.

Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case.

User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the grounds that if the 99.9th percentile behavior is good, then the typical experience is certainly going to be.

We recommend that you standardize on common definitions for SLIs so that you don’t have to reason about them from first principles each time.

Objectives in Practice

Start by thinking about (or finding out!) what your users care about, not what you can measure. Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way.

Defining Objectives

SLOs should specify how they’re measured and the conditions under which they’re valid.

It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both. Instead, it is better to allow an error budget—a rate at which the SLOs can be missed—and track that on a daily or weekly basis. Upper management will probably want a monthly or quarterly assessment, too.

Have as few SLOs as possible

Choose just enough SLOs to provide good coverage of your system’s attributes.

Perfection can wait

You can always refine SLO definitions and targets over time as you learn about a system’s behavior

SLOs can—and should—be a major driver in prioritizing work for SREs and product developers, because they reflect what users care abou

Control Measure

  1. Monitor and measure the system’s SLIs.
  2. Compare the SLIs to the SLOs, and decide whether or not action is needed.
  3. If action is needed, figure out what needs to happen in order to meet the target.
  4. Take that action.

Keep a safety margin

Using a tighter internal SLO than the SLO advertised to users gives you room to respond to chronic problems before they become visible externally.

My Summary

  • User-facing serving systems: availability, latency, and throughput.
  • Storage systems: latency, availability, and durability.
  • Big data systems: Throughput and end-to-end latency

Chapter 5 - Eliminating Toil

In SRE, we want to spend time on long-term engineering project work instead of operational work.

Toil Defined

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

If you’re solving a novel problem or inventing a new solution, this work is not toil.

Toil is interrupt-driven and reactive, rather than strategy-driven and proactive.

Why Less Toil Is Better

At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.

The work of reducing toil and scaling up services is the "Engineering" in Site Reliability Engineering.

What Qualifies as Engineering?

It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the more generalized, the better.

Is Toil Always Bad?

Toil doesn’t make everyone unhappy all the time, especially in small amounts.

Toil becomes toxic when experienced in large quantities.

My Summary

  • Overhead: Administrative tasks such team meetings, setting and grading goals, HR paperwork
  • Grungy work: An example is cleaning the entire alert configuration ( manual, and most likely boring ), yet, it adds value.
  • Toil: Work tied to a running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as service grows.

    • Manual: Running scripts manually.
    • Repetitive: Work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
    • Automatable: Either a machine could execute or a better design solves the problem.
    • Tactical: Toil is interrupt-driven and reactive, rather than strategy-driven and proactive.
    • No enduring value: If your service remains in the same state after you have finished a task
    • O(n) with service growth: If work scales up linearly with service size, traffic volume, or user count, that task is probably toil.

Engineering:

  • Software: Code, tests and its documentation. Automation scripts, tools or frameworks, scalability and reliability, changing infrastructure.
  • System: Configuring production system, improvements for an one-time effort.
  • Toil: Work to a specific service, manual and repetitive.
  • Overhead: other meetings

Toil leads to career stagnation and low morale. In addition to less productivity, precedents, attrition

Chapter 6 - Monitoring Distributed Systems

Why Monitor?

When pages occur too frequently, employees second-guess, skim, or even ignore incoming alerts, sometimes even ignoring a "real" page that’s masked by the noise.

Setting Reasonable Expectations for Monitoring

a Google SRE team with 10–12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.

Symptoms Versus Causes

Your monitoring system should address two questions: what’s broken, and why?

The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause.

Black-Box Versus White-Box

We combine heavy use of white-box monitoring with modest but critical uses of black-box monitoring.

Note that in a multilayered system, one person’s symptom is another person’s cause.

My Summary

  • Monitoring

    • Whitebox: Internally, JVM or http handler
    • Blackbox: externally

Why?

  • Long term trend
  • Comparing over time or results experiments
  • Alerting
  • Debugging