Top PagerDuty Competitors and Alternatives
by Dan Woods
The challenge of effectively responding to operational problems and resolving them in a complex technology environment, the general domain of Incident Response, is getting a huge amount of attention as more and more companies build complex systems out of dozens or hundreds of services and then struggle to keep those systems available and running as effectively as possible.
The role of the Site Reliability Engineer (SRE), the job title most closely associated with incident response, has become a huge focus for hiring and organizational development. Google has created the SRE.google site that has tons of information about how to become an SRE, what the role does, and how to manage the SRE function.
The SRE role was created because the job of building, running, and debugging large applications and platforms that are built from a huge number of relatively simple microservices requires a different set of skills, knowledge, and tools than the job of building and running large, complex applications and services, often referred to as monoliths. The SRE skill set has become a crucial part of fully implementing DevOps practices.
In this article, we’re going to look at the tools that are most closely associated with incident response and examine the perspective each tool takes and what users have to say about it.
While PagerDuty is one of the most widely used tools by SRE teams, this article will show how a huge amount of innovation is taking place to build systems that can serve SREs in new and innovative ways. The PagerDuty alternatives covered here include Opsgenie, Lightstep Incident Response, xMatters, VictorOps (now Splunk On-Call), DataDog, and FireHydrant.
First, to understand why anyone would look for a PagerDuty alternative and provide useful guidance, it is important to have a clear perspective.
The main job of an incident response tool is to help with these functions:
On-call management: Alerting the right person or incident response team via assorted communication channels and then escalating to find someone else if they don’t respond. Making sure adequate resources are available. Calling in extra help if the team is overwhelmed. Supporting self-service scheduling.
Event and alert management and analysis: Helping understand, automatically group, and make sense of a storm of events and alerts from a wide range of monitoring systems so the root cause can be identified and assigned to a team to resolve.
Incident Response: Organizing the process of defining an incident, associating it to events and alerts, constructing the team to resolve it, notifying stakeholders of updates, coordinating the work of the team, monitoring was was done to resolve it, and learning from the process so that root causes can be identified, runbooks can be improved, and opportunities for automation and prevention identified.
Operational reporting and analytics: Creating reporting and analytics dashboards to provide information about team performance and service health. Data from and integrations with application monitoring tools are often crucial to this function.
Runbook creation, execution, and maintenance: Capturing knowledge about how to resolve common incidents, maintaining that knowledge, using it during incident response, and making suggestions to address root causes. Runbooks also advise how to execute processes, when to apply appropriate automations and analytics, and suggest ways to collaborate.
Automation of tasks and processes: Expanding automation to the tasks and processes involved in monitoring, gathering information during analysis of incidents, analyzing the information, and taking action for resolution. Increasingly, AI and ML are being applied in all realms of incident response.
Connectivity, integration, and orchestration of related systems: Connecting to and integrating the data and services from a wide range of systems is crucial to incident response. Such connectivity expands the possibility for automation and analytics.
Based on this definition of what the competing Incident Response products do, let’s take a look at the options.
PagerDuty is a longstanding player in supporting the SRE role and helping manage operations duties needed to ensure uptime. Founded in 2009, PagerDuty went public in 2019, has more than 700 staff, and recorded revenue of $213.6 million in 2020. PagerDuty has product offerings for all the SRE functional food groups including: on-call management, incident response, runbooks (enhanced by the RunDeck acquisition), automation, event management, and operational analytics. Most of the other PagerDuty competitors have focused on a subset of these capabilities.
PagerDuty delivers a reliable and robust capability for scheduling on-call coverage and routing alerts and events from monitoring tools to the right people and then managing escalation. Allowing teams to customize the alerting and event management process and then bring that experience to a mature mobile app makes lots of users happy. So does the wide number of integrations and the ability to use the API to connect to more than 300 other systems. The incident response capability includes such SRE best practices as blameless postmortems. Runbooks use machine learning to identify redundant and duplicate events, suggest actions for resolution, and inform everyone related to the incident what is happening.
PagerDuty users are less happy about the price they have to pay and the need to escalate to higher cost licenses to get access to more functionality. The quality of the operational analytics and reporting is often mentioned as a weak spot, and some users argue that the event routing rules of other products are superior. Many users complain about PagerDuty’s user interface, which they report gets the job done but isn’t always easy to work with or clear. This complaint about user experience is common across the PagerDuty alternatives reviewed in this article.
Among all the competitors, PagerDuty is the most common product at this point used by SRE teams and those doing the same type of work. The company is seeking to defend its position by expanding and deepening its product portfolio with advanced runbook capabilities (based on the RunDeck acquisition), broad use of AI/ML, and even more integrations. To get all that PagerDuty has to offer, you will need to pay for numerous upgrades and licenses for additional features. This is par for the course for mature enterprise software products.
Founded in 2013, Lightstep Incident Response was acquired by ServiceNow in 2021. Lightstep raised $70 million in five rounds, was valued from $100 million to $500 million in the last round, and had about 100 staff at the time of acquisition. Lightstep’s founders came from Google and were pioneers in distributed tracing. The company has played a leading role in the development of standards for open tracing and open telemetry and is a credible PagerDuty alternative. Initially, Lightstep gained lots of fans for its distributed tracing capabilities which are seen as a must have in microservice environments. As the product and number of customers grew Lightstep expanded its capabilities for observability, adding numerous integrations with monitoring services. In addition, Lightstep added more capabilities for event management and monitoring and for incident response workflows and collaboration.
Lightstep’s vision is that a detailed integration between observability, monitoring and response capabilities yields great dividends. Lightstep can quickly identify causes for anomalies as well as support distributed tracing with unlimited cardinality, dynamic service maps, and immediate root cause correlation across traces, metrics, and logs from anywhere in your landscape. Once a problem has been identified, Lightstep offers incident response workflow management and collaboration capabilities that bridge siloes. The runbooks in Lightstep support capture of activity from a wide variety of environments, including the command line, as well as integration with ServiceNow for automation of various tasks and invocations of workflows. The net impact of Lightstep is a reduction in noise and alert fatigue. Not all alerts are incidents and Lightsteps helps quickly tell the difference. Lightstep’s usage based pricing is a popular feature.
LIghtstep users would like to see more consistency in the UI experience as well as expanded documentation in some areas, especially when setting up satellite instances.
Lightstep users that are focused on microservices architectures report they save a lot of time because Lightstep really helps speed the process of diagnosis and also make a rapid change process possible. Now that the company is owned by ServiceNow, LIghtstep seems to be seeking to get ahead of competitors by broadening its focus from cloud-scale, microservices architectures and expanding its mission of observability, monitoring, distributed tracing, and incident response capabilities to the rest of the IT landscape.
OpsGenie was acquired in 2018 by Atlassian, the Australian company with annual revenues of more than $2 billion in 2021.
Atlassian, with more than 5,000 employees, has a large product line that also includes the Jira family of products and Trello, the Kanban style project management software. The products are focussed on supporting agile software development, supporting and fixing systems once launched, building software, and collaboration. Opsgenie fits into the support and fix category of Atlassian’s portfolio and has long been considered a PagerDuty alternative. More than other Opsgenie competitors, the company’s vision of IR starts with its event and alert management capability that “ensures you will never miss a critical alert” according to the web site. The alerting function can then notify SRE teams through multiple channels, enrich the alerts, take automated actions, set policies for how alerts are handled, and implement heartbeat and monitoring functions. The product also handles on-call management with a mature set of capabilities including routing rules and escalations and on-call reminders. The analytics and reporting capabilities allow alert activity and resolutions to be analyzed and to create various metrics. Collaboration through Slack, Teams, and Zoom can happen from within Opsgenie via integrations. This sort of integration is becoming popular in a variety of Opsgenie alternatives.
Users like the way Opsgenie allows you to correlate alerts to recent deployment activity so that the problems caused by any new code can be quickly discovered. The generous terms of the free option that supports five users and unlimited SMS messages is also a hit. As you would expect, Opsgenie is tightly integrated with other Atlassian products such as the Jira ticketing system and the Confluence wiki, which is often used for capturing knowledge and creating runbooks. Opsgenie also supports creating postmortem reports from templates that can be enhanced with queries and analytics to show exactly what happened when an incident was resolved. The mobile apps for Apple IOS and Android are popular, a feature that is becoming standard across Opsgenie alternatives.
As with many IR products, some users fret about the complexity of the user experience, especially how it can be confusing for beginners. Users report that the documentation could be improved, especially with regard to onboarding and initial setup, and that it is not easy to set rules and policies for handling alerts. Users would like to have more capabilities for orchestration and automation of tasks and responses.
Opsgenie fits nicely into the Atlassian ecosystem and covers most of the basic food groups of IR capabilities required by SRE and technical operations teams. It is easy and relatively expensive to get started. It is not clear yet if Opsgenie will be the kind of product that will break new ground or will just keep up with the basics of IR capabilities. For example, many other Opsgenie competitors are enhancing runbooks with a variety of capabilities. While Confluence wikis have been a great way to support knowledge capture, will more be needed to keep up with the growing complexity of IR?
xMatters was founded in 2000 and raised $96 million in eight rounds before being acquired by Everbridge in 2021.
Many xMatters users expressed a desire for more robust integration with ServiceNow, which likely reflects a customer base composed of large-scale traditional IT shops where ServiceNow is ubiquitous. In addition, the Android mobile app was frequently called out as a weak point. Some users desired more flexibility in event and alert routing rules, especially the ability to suppress routing of seemingly high priority events when certain conditions made resolution less urgent. And, as is the case for many xMatters alternatives, users would like a UI that is less confusing.
VictorOps was founded in 2012 and raised $33.7M in funding over four rounds before the company was acquired by Splunk in 2018 and renamed as Splunk On-Call. Splunk On-Call is often mentioned as a PagerDuty alternative.
Splunk On-Call fields events and alerts from a variety of monitoring and alerting tools and then notifies those on an on-call schedule, escalating to backup personnel if needed. The alerts can be grouped and also enhanced by the Transmogrifier, which applies a set of rules that allow annotations and documents to be attached to the alert to help provide guidance about how to resolve. Notifications are sent to mobile apps and other channels. Reporting analyzes how alerts have been handled.
Users like Splunk On-Call’s Twitter-style timeline that allows anyone fielding an alert to see the other alerts that are being processed. Users love the mobile apps, especially the in-app messaging that supports rapid communication. When a longer discussion is needed, Splunk On-Call’s Control Calling feature sets up a conference bridge and invites everyone to join. This sort of integration with various communications channels is becoming the focus of many of Splunk On-Call competitors. In general, once the Transmogrifier rules are set up, SREs get alerts that have suggestions for how they should be resolved.
As with several other VictorOps alternatives, oops, I mean Splunk On-Call alternatives, many users are yearning for a less complex and confusing UI. The most common complaint is about the difficulty of overriding an established on-call schedule to accommodate a temporary change in personnel. There is also a desire by some users to implement runbooks inside the product. Past incidents can be hard to reference and access, which makes learning from experience harder. Some users feel that since the acquisition by Splunk the pace of innovation has slowed compared to other alternatives.
Datadog Incident Management was launched in 2020 to add support for incident response to Datadog’s cloud monitoring service.
DataDog Incident Management is focused on incident management and does not offer on-call management capabilities, unlike most of the other DataDog competitors, which makes DataDog only a partial PagerDuty alternative. The point of Datadog Incident Management is to automate as much as possible the process of analyzing alerts and creating incidents and then identifying the team needed for resolution. The product supports collaboration and knowledge activity capture with interactive timelines, and also allows much of the work to take place from within Slack or the mobile app. Datadog’s broad set of integrations allow deep dives into the metrics and alerts as well as automatic creation of tickets and other mechanisms needed for tracking and collaboration. Activity to resolve incidents is automatically harvested to create post mortem reports and also to report on common metrics related to incident response, such as MTTR.
Datadog Incident management offers interactive and real-time notebooks that support comments and embedded graphics as part of the product, avoiding the need to create runbooks and other key documents in other systems. Users like the tight integration with observability functions that allow the path from the incident to the exploration of metrics to happen in a seamless fashion. The Slack chatbot client also allows quick responses to issues in advance of diving into the product to do detailed exploration.
Compared with other DataDog alternatives, the lack of on-call management is a challenge for some users, although Datadog does integrate with other incident response apps such as PagerDuty and Opsgenie. For Datadog enthusiasts, the solution works well, but for those users seeking the automation of responses and other activities, these mostly must take place in systems that Datadog integrates with.
If you have a large complex environment and are already a Datadog user, Datadog Incident Management may make a lot of sense compared to alternatives. If not, SREs will not find a one stop shop for incident response at Datadog, especially if you don’t have a large amount of tooling for IT Service Management, observability, and automation in place.
Based in New York and founded in 2018, FireHydrant has raised $32.5 million in two rounds. The company has more than 50 staff. Seeking to help its clients “manage the mayhem,” FireHydrant focuses on defining, supporting, and automating IR processes. Like DataDog, FireHydrant is not a complete PagerDuty alternative because it does not do on-call management.
FireHydrant seeks to bring best practices to incident management based in part on the FEMA’s Incident Commander framework. Like some FireHydrant alternatives, the product allows incidents to be declared and managed from Slack and, through integrations, use other tools to support functions such as on-call notifications and ticketing. FireHydrant seeks to accelerate the process of resolving incidents by using its service catalog that tracks services, people who own them, observability data, and deployment activity. By tracking changes by monitoring deployments FireHydrant hopes to point you to where problems started. Further acceleration is supported by increasing automation in runbooks to boost efficiency and make more time for resolution. The product has some interesting features such as end-user facing status pages that automatically are updated when services are disrupted. To support retrospectives, a timeline of the activity taken during an incident is automatically captured.
More than other FireHydrant alternatives, users really like the way that FireHydrant clarifies roles in the incident response process, assigning responsibilities to the incident commander and other supporting roles, which brings consistency to the entire lifecycle. That said, the best practice processes in the product are only a starting point and can be modified in any way to support the way that a SRE team works. This mechanism brings process discipline and accountability as well as providing a structure when introducing new processes. The Slack integration and automation of reports for retrospectives are also popular.
Some FireHydrant users expressed frustration about the lack native on-call capabilities and the inability to clone runbooks. Others complained that it was not possible to update and correct a retrospective report once it was published. Also, some said that it was not always clear how integrations worked.
This analysis should give you a better sense of what PagerDuty does vs. the alternatives. Whether you are interested in PagerDuty vs. Opsgenie, PagerDuty vs. Lightstep, PagerDuty vs. xMatters, PagerDuty vs. VictorOps, the sweet spot of each of these products should be more clear.
If you liked what you read about Lightstep Incident Response, please
You can also learn more about IT alerting and why it is important.
You can learn more about Lightstep Incident Response at our Resource Center.