Lightstep from ServiceNow Logo

Products

Solutions

Developers

Resources

Login

Lightstep from ServiceNow Logo
DevOps Best Practices

Overview of Site Reliability Engineering


Lukonde Mwila

by Lukonde Mwila

Overview of Site Reliability Engineering

Explore more DevOps Best Practices Blogs

In addition, with the growth of microservices, continuous deployment, and cloud computing, Site Reliability Engineering (SRE) is key to providing uninterrupted service delivery of employee and customer-facing services, especially in the digital era.

SRE is a concept that was developed by the Google engineering team and has become a strongly desired model in the software operations lifecycle. More specifically, it can be credited to Ben Treynor Sloss who said, "SRE is what happens when you ask a software engineer to design an operations team."

Traditionally, system administrators manually performed the tasks and processes involved in ensuring system reliability. The goal of SRE is to transform and optimize these processes through software automation and continuous improvement. As a result, companies can benefit from automated and repeatable practices to reduce downtime and contribute to the success of DevOps methodologies, by bridging the gap between the development and production team

In this article, you'll learn what SRE is, why it's important, how it relates to DevOps and incident management, and what an SRE role looks like in terms of responsibilities.

Exploring SRE

As the name implies, SRE is all about making sure software products are reliable. However, software developers and operations teams tend to prioritize different aspects of the concept of reliability when it comes to the software release cycle.

Naturally, software developers want an expedited release of new features, refactors, or bug fixes. Ops teams are concerned with release processes to ensure that the end users have a reliable product. However, these processes are often viewed as blockers that "hold back" the release cycle. SRE is the bridge between these two worlds for the sake of both goals, velocity, and quality, to be met through automated software procedures.

SRE teams set service level agreements (SLAs) that specify how reliable the system should be to avoid ambiguous or loosely defined concepts of reliability. For example, an SLA of 99.5 percent uptime per year means there's room, or a budget, for 0.5 percent of errors in a year. Though SRE's goal is to increase software reliability, a specified margin of error should be set and planned for, as opposed to setting a target of 100% uptime in an SLA. The reason for this is that 100% uptime availability is accepted as impossible, but near 100% is not.

SRE Key Practices

Why SRE Is Important

objectives of the development team, which looks to push more features, and the ops team, which looks to keep the environment stable. With both teams working to achieve SLA targets and stay within the budgeted margin for error, the traditional divide between them is narrowed, and there's a more cohesive and collaborative workflow.

Monitoring for System Reliability

Monitoring is essential for achieving system reliability and maintaining availability. It's a large and multi-faceted concept that entails a host of automated activities, such as the collection, processing, and aggregation of your systems' real-time data. SRE teams use monitoring to identify different types of system and performance errors.

Achieving system availability is an ongoing engineering process that requires visibility into the software system's infrastructure; monitoring is the gateway to that visibility, as it provides a bird's-eye view of their systems. Monitoring consists of both reactive and proactive events, in the sense that it can inform SRE teams of something that's either broken or about to break.

In practice, events such as metrics breaching related to latency, traffic, errors, and saturation are used to trigger alerts. Some systems are self-healing, like Kubernetes' declarative system, but in other cases, alerts inform teams who in turn respond to incidents to determine the root cause and remediate the issue.

SRE Incident Management

As mentioned above, monitoring is a crucial step in SRE because it informs the incident management lifecycle. There are four golden signals in monitoring that feed into incident management. When on:

  • Latency: The time taken to serve a request.
  • Traffic: The stress from demand on the system.
  • Errors: The rate of requests that are failing.
  • Saturation: The overall capacity of the service.

These golden rules serve as a launch pad for successful actionable items in incident management.

Incident management refers to the process of identifying and rectifying system incidents that pose a threat to or disrupt an organization's software services. The lifecycle of incident management is critical because system disruptions and downtime can be very costly and have a negative impact on attracting new customers and retaining existing customers.

SRE vs. DevOps

There's often a lot of confusion about the difference between DevOps and SRE because of their philosophical, or conceptual, overlap.

DevOps is a combination of philosophies, practices, and tools that support an overarching goal of high velocity, quality control, infrastructure management, and operations in the software release lifecycle. SRE is also concerned with the high velocity and quality of software in the release process, but more with maintaining a service's agreed performance.

While DevOps has a heightened focus on the pipeline methods and workflow in achieving high velocity and quality, SRE is more concerned with operational and reliability problems and using software to identify and remediate system incidents to maintain an agreed error budget (as per an SLA).

Put simply, DevOps deals with the efficiency of the release, whereas, SRE deals with the reliability of the release. It is common for the two to be practiced in parallel(i.e a DevOps practicing engineering, the organization will have SREs on the team).

SRE Skills and Responsibilities

Fundamental SRE engineers' skills and responsibilities revolve around investigation, analysis, and optimized remediation of software systems. In practice, systems vary, and by implication, SRE engineers need an appropriate understanding of the relevant technologies and tools for a particular system to help them effectively investigate, analyze, and remediate problems. Some of the universal roles and responsibilities that SREs should undertake are as follows:

Developing or Building Software To Support Operations (ie, DevOps and IT Ops)

SRE engineers should be comfortable with coding and scripting in various languages, such as Java, .NET, Golang, Scala, Node.js, and Python. In many cases, SREs come from development or an ops background, which serves as a good launchpad. SREs use their programmatic skills to continuously develop software that helps automate the operational procedures of software systems.

Resolving System Support Issues That Arise

System errors and incidents are an expected occurrence in the life cycle of support and operations. A big part of SRE is being on-call for the respective teams to receive prompts and alerts so they can resolve any issues, even if it's at 2 AM. Proactive and reactive events will trigger monitoring alerts, it's important for SRE engineers to create automated and proactive workflows. SRE teams have to be innovative in their remediation efforts and automate repetitive tasks so they can dedicate more time to optimizing their workflows.

Conducting Post-Incident Reviews

Incident management doesn’t end when a system degradation is solved. SREs must see the follow-up task of reviewing incidents in the same line of priority as resolving them. In these reviews, SRE can engage with other teams and stakeholders to identify root issues and causes as a first step toward developing better procedures and solutions to avoid similar occurrences in the future.

Documenting and Communicating Workflow Procedures and Architecture Designs

SRE teams interact and engage with development teams, DevOps engineers, and business stakeholders. They are meant to be cross-functional across departments, and therefore, should lead the way in documentation and knowledge sharing of objectives, processes, architecture designs, and workflows for all interested parties to benefit. SREs can influence system design to ensure performance in production.

Final Thoughts

In this article, you learned what site reliability engineering is, how it came to be, the problem it solves, and why it's important. Furthermore, this post explored how monitoring plays a significant role in the SRE workflow, and how it feeds into the incident management lifecycle. Lastly, we covered the overlap and differences between DevOps and SRE, as well as the main responsibilities and skills that SRE engineers should have.

Monitoring large distributed systems is a complex task that requires various tools to help automate, optimize, and simplify the processes for SRE teams.

Lightstep is a cloud-native reliability platform that can enhance the efforts of your SRE and DevOps teams.

Sign up for a free trial of Lightstep Incident Response

Explore more DevOps Best Practices Blogs