What Happens When Your Cloud Integration Starts without Observability
by James Burns
Cloud services have changed the way applications are developed. They allow teams to focus on their value proposition, product, and customers. As part of the evaluation for a cloud service, you might talk to friends, look at recent feature additions, speak with sales about their roadmap. You may be choosing a cloud service to offload the operational burden to someone else, but just because you’ve offloaded it doesn’t mean it can’t fail.
Even if, or perhaps especially if, you’re a small company, understanding what will happen to your customers and to your business if the cloud service fails becomes key. It pays to look at status pages, recent outages, or public post-mortems. However, the true test is when you integrate the new feature using a cloud service, and see how your workload and the performance (or failure) of the cloud service interact.
Delaying this assessment until late in a project creates substantial risk.
Working on a recent project, my team had created what we believed to be a scalable and resilient architecture. We were just starting to use Chaos Engineering to test out our resilience and observability. During the second round of testing I was responsible for determining the experiments to try. I thought it might be interesting to test how we would observe a third-party cloud component (let’s call it PipelineAPI) failure.
The result of the experiment was that we didn’t — and couldn’t — observe PipelineAPI failure.
After a couple of sprints dedicated to closing the instrumentation gap for all cloud services in the project, during normal operation we started to see significant performance variance by one of the cloud services, we’ll call LogsAPI, in the critical path. After dashboard screenshots, discussions with support, and eventually a meeting with the product manager for LogsAPI, it became clear that it was not designed to support our use case. We pivoted to another similar cloud service, we’ll call BigDataAPI, with all the instrumentation in place from the start. We observed consistent latency and consistent availability of the data in the data store with no change over time or increasing amounts of data. With this data we gained confidence that BigDataAPI would be able to support our use case and growth.
To be honest, we got lucky. We hadn’t thought through how we would observe the performance of third-party cloud services as we scaled out the system. We didn’t instrument some of the core functionality of our system because it was not the code we were writing or testing. If we’d launched without the testing and observability our system would have failed at even 5% of the target traffic level. Instead we scaled smoothly over 100x in the first two weeks and had a deep understanding of the performance and resilience of our system the entire time.
It is better not to be lucky. Measure the performance and availability of any cloud service you are designing into a solution — as soon as possible and well before full production deployment. Watching the trending of outlier performance, it becomes easier to see if the cloud service is keeping up with your testing or canary traffic, or not. Conversations with cloud service vendors about expectations and performance are easier when you have consistent high resolution data to support your observations and questions.
No matter what you are building, the customer has expectations of your system. Closely observing your cloud service integrations is the best and easiest way to make sure you meet those expectations.