Why Observability Is Crucial for Developers
by Karthik Kumar
One of the coolest things about my job is using our own product, Lightstep, to monitor the ongoing development of Lightstep. On the product side, we find opportunities for feature improvements and usability enhancements. On the operational side, we uncover issues, debug incidents, and gain more confidence in the software we’re releasing.
In this post, I’ll go through examples of how observability has been crucial for tasks that developers undertake regularly: decomposing a monolith, deploying new code, debugging CI issues, and triaging production incidents.
“Don’t deploy on Fridays” is a common platitude in software engineering. Our motto is “Don’t deploy on Fridays… without proper observability.” We tested this out a few weeks ago, when we deployed a new version of one of my team’s services. The changeset included unfamiliar changes, so we kept a close eye for a specific operation that we knew was important. Service Health for Deployments indicated an issue with increased error rates.
First, a new version of the service was detected (vertical line). This operation is infrequently triggered, but should never return an error, so seeing a 100% error rate immediately after a deploy was alarming.
Deploying with confidence means quickly understanding what has changed. In this case, we dug into errors that weren’t seen before the previous deployment.
The two-time windows selected are the Baseline (blue, before deploy) and Regression (red, after deploy). The windows are of different width because more traces are captured for analysis when an anomaly is detected.
Next, there are two pieces of information that point to the root cause. First, an operation diagram aggregated across the population of traces helps visualize the call stack and directly implicates “sql/select_one” as the source of errors (red halos indicate errors). The table below links to a few traces that may be of interest.
In the trace view, a log message has the answer. A problem with deserializing a column led to errors that propagated up to the user. In this case, we rolled back the deploy and had enough information to solve the problem.
Being on-call is stressful. This stress increases with the complexity of the architecture. But there’s hope: having end-to-end observability in your systems is critical to reducing this stress.
During a recent on-call shift, I faced a nightmare scenario: paged in the early morning hours for an unfamiliar alert from an unfamiliar service.
The alert came from a Stream set up for an important operation. An SLO was violated (error rate was above 5% for the last 10m). From the chart, the latency had also spiked to around 15s.
Digging deeper into this specific operation exposed the layered complexity of this operation (each node is an operation, yellow halos indicate latency contribution). There were several services being called so the surface area of the investigation could potentially be very large. But the operation diagram shown above gave me a hint on where to begin the investigation. Tribal knowledge told me that the “bento” service is responsible for managing calls to our SQL database, so looking there seemed like a promising first step.
The database logs indicated an unexpected crash and restart. At this point, the root-causing was complete. Remediation steps were taken, and the Mean Time to Get Back To Bed (MTTGBTB) was drastically reduced by narrowing the investigation to only the relevant service.
An old manager used the phrase “changing the engine of an airplane while you’re flying it” to describe a migration to microservices. Downtime is taboo in most organizations, so observability is often the best way to safely break apart a monolith.
We recently decomposed a monolithic gRPC service (called “liveview”) by moving a few RPCs to a new service (called “historian”). Lightstep’s Service Directory helped validate that the ingress operations (the ones we care about) were correctly transitioned over to the new service. The “Golden Signals” indicated the following:
- Latency did not change — so historian had proper resource allocation.
- Errors remained non-existent (0%) — so we didn’t introduce any bugs.
- Throughput remained the same in historian and went to 0 in liveview — so we didn’t forget to update a client.
Above: Liveview’s traffic is transitioned to Historian.
We use Cypress for Continuous Integration. Cypress tests run against our staging environment, which can be fairly volatile (frequent deploys from HEAD). A failure triggers a Slack notification like this one:
This usually prompts multiple questions: was there a back-end issue? Is it transient? Is the test obsolete? Was it caused by a deploy? Which team do I escalate this to?
To provide more visibility into test issues, we created a Stream in Lightstep to track errors in the specific project used by the tests. Here’s an example from a recent test issue:
Front-end engineers triaging the test failures can take a look at this Stream to identify possible root causes. In this case, the issue was scoped to the “liveview” service and the appropriate team was notified for further investigation.
Are you adopting microservices? Do you relate a little too well with the examples above? If you’re interested in seeing how LightStep works, check out our Interactive Sandbox.