Lightstep adds complete system context to PagerDuty alerts
by Fran Thorpe
“Now developers automatically have PagerDuty on-call details inside of a pull request, alongside system health details, at their fingertips in one screen.” Steve Gross, Sr. Director, Strategic Ecosystem Development at PagerDuty
There is a lot of noise surrounding the term “Observability”. While vendors and pundits debate three pillars, Lightstep has partnered with PagerDuty, to ensure software teams can move from context within an incident to quickly understand and determine root cause. Together we’re augmenting incident response solutions for pre-production scenarios.
Today, when a developer gets an early-morning notification and it’s unfortunately a major incident, they immediately want to know the context surrounding that incident. Lightstep adds extensive insights and correlation detail for the production system to PagerDuty's incident response workflow. Given the rich Lightstep and PagerDuty data-sets and the context they bring, we saw an opportunity to help developers understand an incident even before opening the runbook.
Both alert and observability context can also provide relevant insights just before developers make an important code change to a production system. For example, when service owners working in GitHub are about to merge a pull request that has passed code review, they are likely missing important information without switching between different solutions. They don’t have access to product health context, and they don't necessarily know who is on-call and responsible for the service in production. For Lightstep and PagerDuty, this provides an opportunity to ask and answer “Is the code ready to deploy?”
Recently Lighstep published the Lightstep Pre-Deploy Check GitHub Action, providing an opinionated view of the health of the whole system before developers merge their service’s code, inside a pull request. Automatically surfacing complementary data from Lightstep and PagerDuty, just before a merge is initiated, helps teams ship move quickly and reliably. Developers gain additional context: who owns which service, information about the on-call team, and even an immediate view of system health and performance via a Lightstep Snapshot.
If issues are surfaced by the Action, the developer has what’s needed to investigate before clicking the merge button. This is very different from a production issue or ongoing incident. The Action gives the developer visibility to the grey area where latency might be slightly higher although the customer experience is not adversely impacted yet. The developer now has all the context needed, including the name of the person on-call for the service, before they decide the system is all clear to deploy new code.
Lightstep brings context to services in PagerDuty using the new Change Events API. The Action detects issues with the production system, and generates a Change Event. In addition to customized messages (i.e.”Lightstep Pre-deploy Check failed”), the Action attaches metadata: the pull request and a Lightstep Snapshot.
The Incident Response team now has real-time access to all the telemetry for a production system at the time the code merged all the traces, metrics and correlations presented in a easy-to-consume UI that includes a service diagram. With Lightstep Pre-Deploy Check and the PagerDuty Change Event, developers and Incident Response teams have more control, and a simple and clear way to see all the interactions between what they are developing, deploying, and then investigating, when something inevitably goes wrong.