Have you ever been on call when something had a temporary latency spike that was possibly just a network blip? Or maybe the service was unhappy and went into a GC spiral? Sometimes it just goes away on its own, or sometimes you restart the service and things start to look better. It’s all too common these days to feel like the systems we operate are feeling emotions like “unhappiness” rather than behaving deterministically.
We already know that software is deterministic – it does what you program it to do. So, why do we sometimes feel like it’s not doing its thing? Programs don’t get tired, so why does restarting do anything at all, let alone magically fix it? You wouldn’t expect a Fibonacci generator to just randomly give up or slow down, so why do large distributed systems sometimes seem like they do?
Most of the time, the root cause is not a hardware failure; it’s a change in code, configuration, or usage patterns. If things used to work and then stopped working, something definitely changed. When we can’t find or even fathom a reason, we start to use language like “the server is unhappy.” Unsurprisingly, our ability to find the correct explanation for a problem is inversely correlated with the complexity of the system. In a system with thousands of changes per week, dozens of capacity dimensions, and hundreds of quotas up and down the stack, things get quite complex. Complex systems make it harder to reason about why things aren’t working as expected, so our brains explain that gap with magic.
So what should we do? Do we just surrender to the increasing complexity of our distributed systems? Throw our hands up and perpetually play whack-a-mole to keep the site running? That’s definitely a strategy, but it tends to bog down development and cause all sorts of unhappiness in the humans, so I don’t recommend it. Another strategy is to make fewer changes – this might work if you’ve finished your product (if that’s possible). But if you’re trying to stay competitive and constantly improve your product, this anti-change force can be a serious problem.
We’ll spend the rest of this post on our preferred strategy: use observability to demystify these behaviors and gain a deeper understanding of your system, and how you can use that strategy to move faster.
An unhappy server story
A couple of weeks ago we had an issue when one of our services failed to load the latest configuration consecutively for 10 minutes, at which point it exited to avoid running on a stale configuration. Below is a rough sketch of the services involved.
The way this service works is that it periodically connects to our backend which then loads customer info from the database service, extracts the relevant config, and returns it. The backend has a local cache to avoid concurrently and repeatedly loading the config from the database because it’s totally fine to serve data that is a few seconds stale in this case.
The first thing we looked at was the latency of this polling operation. The tail end of the latency was high enough to cause repeated timeouts, so it was possible for an instance to get unlucky several tries in a row and then fail.
In the latency histogram above, latency buckets are on the x-axis, and the y-axis shows how many operations fell into each of those latency buckets. You can see the current data (blue bars) has many more operations falling into the slower buckets compared to the previous week (yellow line).
So the next question we asked: “is this high tail latency expected? Did something dramatically change or did we just exceed the allocated capacity?”
It turns out a bunch of things changed in the preceding several days. We had released a refactor of this polling process that resulted in some more database queries, which showed up as an increase in load. We also saw that some of the other database endpoints were receiving more requests which we hypothesized was due to the retry mechanism after receiving timeouts.
This graph above shows the latency dramatically increasing over the course of the week. There is a corresponding increase in errors (red line at the bottom).
Initially, it looked like latency had just crept up such that the timeout was too aggressive, and occasionally, a server would get unlucky too many times in a row and give up. By looking at the critical path in LightStep [x]PM, we saw that the latency was attributed to our database service.
Because multiple things changed around the same time, and the issue itself (many consecutive timeouts) was not persistent, we couldn’t pinpoint the exact time it started. So we decided to run some experiments to collect more data. At this point, it was easy to disqualify the hypothesis that the service was simply under-provisioned and needed more capacity to support the new load. After we spun up a few more instances, the latency didn’t return to normal unfortunately even after the load per instance returned to levels prior to the start of the problem. Meanwhile, one of our engineers was surprised by the CPU usage of database proxy relative to the QPS it was handling. With some CPU profiling, we found that all the timeouts were causing significant connection reestablishment (which was very expensive because of TLS).
After some spelunking in [x]PM, we found that many of the new requests were from a particular customer and represented a disproportionate volume. It turned out that this particular customer had coincidently misconfigured something around the same time we made some changes, and this misconfiguration resulted in a high volume of invalid requests. The invalid requests were causing errors and our local-cache wasn’t performing negative caching, so it ended up hammering the database with these requests. By adding support for negative caching, the latencies dropped dramatically and the error rate for the polling endpoint recovered.
In this case, by digging deep into why this particular polling endpoint had gotten “unlucky” with consecutive timeouts, and then subsequently digging into the traffic changes on our database service, we found a long-standing bug that had been triggered by a customer change around the same time we made some changes ourselves. Throughout the process, there were several mitigation strategies we could have used like continuing to add capacity, increasing the cache TTL, or rolling back some changes, but they would have all temporarily or permanently masked a real bug that was actually easy to fix. By digging in and getting to the bottom of it, we’re saving money, have better latencies, are more resilient to customer misconfigurations, are more aware of the benefits of negative caching, and have more confidence in operating our services.
Sharing what we learned
Over time, we’ve learned that we can leverage data about what our system is actually doing to can find logical explanations for most problems we observe. Once explained, these mysteries not only make interesting stories but also inspire insight into how our system works and where it can be improved. In the example above, we not only fixed the underlying issue, we hardened other parts of our system. When we look at it that way, it kind of seems obvious: use data to find an explanation and learn, duh!
At LightStep, we’re pragmatic about it – we don’t beat ourselves up if we observe a problem in a poorly instrumented part of the system or fail to find a root cause immediately. We know that trying to be militant would be unsustainable because there is finite time. We focus on getting the most value, and then we move on to the many other things we have to do!
That said, there are a bunch of things we do to incrementally improve our systems and that you can try:
- Prioritize instrumentation, learn your tools, and put the insights to good use. As you incrementally invest in observability, not only does your system become easier to maintain, your organization also develops a deeper understanding, so you can move faster and with more confidence.
- Don’t get discouraged by complexity. Think about the confidence you might get from improving a simple program like a Fibonacci generator; you can’t get all the way there for a large distributed system, but there are definitely degrees of confidence.
- You can do a lot better than “the server is unhappy.” Be on the lookout for language like that. It’s usually a good learning opportunity or at the very least a good time to fill some gaps in instrumentation.
At LightStep, we believe that having a deeper understanding of our system is an essential part of operating with confidence at scale – it’s both in our engineering culture and part of our company mission.
If you’ve got any interesting stories or long-standing server unhappiness, shoot me an email and let’s talk about it.