Move fast and know what’s broken: Announcing Service Health for Deployments
by Talia Moyal
Superstition runs through my blood. When I was a product manager, I remember telling my team, “Never launch on a Friday!” The gut-wrenching feeling I had of customers complaining about a new release or running into newly deployed issues didn’t seem so different from what the engineers were feeling about deploying, especially those on call.
At Lightstep, we started to think about this problem. What did we need to give ourselves peace of mind when pushing something into production? And we realized, we wanted to be able to:
- Quickly catch a performance regression if something (inevitably) went wrong
- Easily isolate the root cause of that regression.
We wrestled with how to bring this level of confidence to deployments, and after months and months of work, we are proud to launch Service Health for Deployments, the firstobservability tool to automatically show you what impacted your service’s performance during and after a deployment — and surface why it happened.
With this release, developers no longer need to wonder if a deployment has impacted their service level indicators, such as latency, error rate or throughput. There’s no reason to get stuck in the cycle of rolling forward, getting alerted, guessing possible fixes, rolling back, rolling forward again, getting alerted again, guessing on possible fixes again, etc. etc., when you can get an immediate understanding of service and system health.
How Does Service Health for Deployments Work?
Service Health for Deployments allows developers to proactively monitor deployments or reactively investigate regressions by:
- Comparing historical latency histogram distributions
- Identifying tags that have the biggest impact on latency through Correlations
- Viewing operations or service diagrams to see the lifecycle of a request through a service or system
- Comparing before and after views of a regression
- Viewing the latency, error ratio and throughput for a service, all while seeing when a deployment for that service occurred
Service Health in Real Life
Imagine this: It’s Friday afternoon and you are deploying a new version of the inventory service powering your ecommerce platform. There are some risky changes going out, so you want to make sure everything looks healthy before you leave for the weekend. And it’s your partner’s birthday dinner, so you don’t want to be late.
You jump into Lightstep and see that the operation update-inventory in your inventory service is showing a latency increase.
You click on the latency spike and start investigating. You select a baseline an hour before your 2 p.m. deployment in order to compare how performance changed before and after you rolled out.
You start to compare tags to see if anything strongly correlated to this latency. You see that large_batch:true tag is new and correlated with high latency.
Now you want to identify which operation is taking the most time in the service you just deployed. You use the Operation Diagram to find the critical path intraservice. You see that the write-cache operation is contributing the most to latency.
You validate this by grouping traces on the large_batch tag to see if there is a marked difference. The trace analysis table confirms that your deploy to submit large batches is resulting in considerably more latency.
Thankfully, this is a quick fix and it’s not even 3 p.m. yet.
Wait there’s more: Instrumentation Quality Scores
In addition to providing insights into service health, Lightstep can also help ensure that your services are properly instrumented and therefore always ready to provide you the value you need during a fire.
With Instrumentation Quality Scores, your team gets specific advice on how to make improvements to instrumentation, and an easy-to-understand overview of the quality of that instrumentation.
Our hope is to make developers' lives as easy as possible. We recognize the pain that comes with a multi-tab, manual investigation, and we’re changing that with a one tab solution, that automatically surfaces the regressions you should care about and why they happened.
Check out these features for yourself in our interactive sandbox.