Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Defining Services for Incident Response

When implementing and managing an incident response solution for your production systems and services, you may run into the need to define the services your team owns. Throughout this article, we will provide prescriptive guidance on how to define services for your incident response, based on the foundations of ServiceNow’s Common Service Data Model (CSDM)ServiceNow’s Common Service Data Model (CSDM), defined as a “standard and shared set of service-related definitions across our products and platform that will enable and support true service level reporting while providing prescriptive guidance on service modeling”.

What is A Service in Lightstep Incident Response?

The formal definition from CSDM is “A service is a means of delivering value to customers by facilitating outcomes customers want to achieve without the ownership of specific costs and risks.”. In the context of incident response, it is the consumed services that provide value to your business that you want to ensure are available and reliable, and that you want to route and report on alerts and incidents against. This is the categorization for the type of issue (application, database, subprod, prod, etc).

What Should You Track as a Service?

There are two different types of services we recommend tracking alerts and incidents against, in relation to the best practices of the common services data model: technical and application services.

Technical services

Technical services associate with service owners. Technical services are lower-level leaf nodes of one or more application services in a structured hierarchy and underpin one or more application services. These should be one level, not a hierarchy, and should be provider focused: the technology provided for the business and other teams to consume or sell.

Application services

Application services are logical representations of a deployed application stack or system. unique instances of an application are created per ‘environment’ (e.g., dev, QA, prod, or per region (EU, NA, East, West).

Defining relationships between which services are consumed by which other services will help efforts in both impact analysis, as well as root cause detection.

Example

Let’s look at an example SaaS healthcare platform that may want to use LIR to manage availability of business applications for charting and pharmacy prescriptions. In LIR, they would define and track services for all the application & technical services indicated in orange, as these are consumed by the business applications to provide the service. You want to track response workflow, reporting, and ownership of issues against these underlying applications and technical services.

SaaS Healthcare Platform LIR Example

Source: CSDM example series with Mark Bodman.CSDM example series with Mark Bodman. For smaller teams, or organizations beginning on the journey to identify and define services, a good starting point is to define the level of your application services. Any alert or incident that would have been routed to an underlying technical service, can be defined to go against the application service. This is more reasonable for smaller teams where the owners of the underlying technical services are the same as the owners of the application service.

For larger organizations with mature service definitions and more specialized owners, - we recommend teams track at the level of technical service. This is because you’ll have better visibility in identifying where your issues occur in the underlying technical services and can route the work of alerts and incidents to separate expert teams that may own the different technical services.

Define Services within your Incident Response Workflows

In Lightstep Incident Response, we define the service on every alert or incident. There are many benefits associated with this approach. First, using a service-based approach for work improves routing & prioritization. Primary and supporting on-call teams can be defined and updated for a given service, including the list of business stakeholders you may want to inform when incidents occur. The on-call users for these primary and supporting teams will be automatically notified as per defined escalation policies, while the stakeholders can be informed through intentional status updates. When handling many high-priority incidents, you can also use the priority of the service to help you decide what to work on first, ensuring the most valuable services are restored.

LIR paymentAPI-Service Details

When reporting on the volume of incidents and alerts in relation to services, you can identify trends in what systems may need projects to improve reliability. The service construct acts as a dynamic categorization that offers the ability to redefine ownership, without needing to create a new integration, categorization, or routing logic. Using the recommended application and technical service nomenclature will also improve the explanation of what’s affected when working in your environment. The last benefit of services with clear on-call teams is a quick directory of who owns what in your system, along with clear information on who to contact if needed.

Lightstep Incident Response - In-app Services Charts

With lightstep incident response, you can ingest work from monitoring and observability integrations against your services, as well as ingest changes against them. A service status view and service dashboard are also available for quick service health insights for your team. You can even provide service specific links to documentation and playbooks to appear on alerts and incidents affecting the given service to further help your teams. With a defined service structure, you can manage work and improve the resilience of your business services by ensuring the underlying application and technical services are running efficiently.

Get a free trial of Incident ResponseGet a free trial of Incident Response
October 17, 2022
5 min read
Technical

Share this article

About the author

Darius Koohmarey

Darius Koohmarey

Read moreRead more

Monitoring PostgreSQL with OpenTelemetry and Lightstep

Robin Whitmore | Feb 15, 2023

Get an in-depth walkthrough of how to set up monitoring of your PostgreSQL instance with OpenTelemetry and Lightstep.

Learn moreLearn more

Three Terraform Mistakes, and How to Avoid Them

Adriana Villela | Nov 16, 2022

Learn how to fix three Terraform gotchas around module and provider configurations.

Learn moreLearn more

Events, Alerts, and Incidents - What's the Difference?

Darius Koohmarey | Oct 5, 2022

Understand the difference between events, alerts, and incidents. Lightstep Incident Response is the all-in-one platform that enables developers, DevOps, and site reliability engineers to respond quickly and effectively to incidents.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems