Previously we’ve written about having hard conversations with cloud providers. On Sunday June 2nd, Google Cloud Platform had an extended networking-based outage. There was significant disruption of commonly used services like YouTube and Gmail, as well as Google hosted applications like Snapchat. The incident currently associated with the outage, 19009, indicates a start time of 12:53 US/Pacific and a resolution time of 16:56 US/Pacific. LightStep Research’s ongoing synthetic testing shows that the impact was longer than the advertised incident report and provides an example of the type of evidence you can share with a cloud provider when discussing an outage.

Summary of Findings

From 11:48 to 11:53, access between us-east1 to GCS regional buckets in us-east1 was completely disrupted. From 11:48 to 12:10 latency for at least 50% of requests was significantly higher from us-east1 and us-central1 to GCS regional buckets in us-east1, us-central1, and europe-west2. From 12:10 to 14:53 access was significantly slower for 5% or more of requests both inside and outside of the us-east1 region. From 11:48 to 12:03 latency was also elevated for europe-west2 to europe-west2 regional bucket access.

us-east1 Metrics

Requests From Us East1 To Us East1 LightStep 𝑥 PM

This screenshot is from the LightStep application’s historical view, Streams. Latency is shown on the top with lines for 50th, 95th, 99th, and 99.9th percentile (p50, p95, p99, p99.9). For the p50 line, this means that 50% of requests took more than the displayed time. Similarly for p95, 5% of requests took more than the displayed time. Below the latency graph, the request rate is shown. For this test, there are 50 requests made every minute, leading to the displayed rate of slightly less than 1 request per second. At the bottom is the error rate percentage, meaning the number of errors divided by the total number of requests.

This graph shows requests from a Google Cloud Function in us-east1 to a Google Cloud Storage regional bucket in us-east1. Following the start of the outage, there is an approximately 5-minute gap where no requests are successfully made. Relatively quickly, about 22 minutes after the start of the outage, p50 latency has recovered to the previous normal value. However, p95 latency does not recover until approximately 2 hours and 43 minutes after the p50 recovery.

Requests From Us East1 To Europe West2 LightStep 𝑥 PM

This graph shows a similar sequence of events for requests from Google Cloud Functions in us-east1 to a regional bucket in europe-west2. However, this does not show the gap in requests, suggesting that requests to europe-west2 would have been more likely to succeed than same region requests, an interesting finding.

Requests From Us East1 To Us Central1 LightStep 𝑥 PM

This graph shows request from us-east1 to us-central1. The recovery in this case is less clear and there appears to be a further, though less severe (affecting only p99 and p99.9), disruption at the end of the displayed time window.

us-central1 Metrics

Requests From Us Central1 To Us Central1 LightStep 𝑥 PM

This graph shows us-central1 to us-central1 same region request traffic. Though the GCP incident states that the disruption was in the east, the central region internally was impacted through most of the outage window.

Requests From Us Central1 To Europe West2 LightStep 𝑥 PM

Traffic to europe-west2 from us-central1 shows the same pattern.

Requests From Us Central1 To Us East1 LightStep 𝑥 PM

As expected, impact from us-central1 to us-east1 is more severe in terms of peak latencies. The time frame matches the other observations.

europe-west2 Metrics

Requests From Europe West2 To Europe West2 LightStep 𝑥 PM

This graph showing same region requests from europe-west2 to europe-west2 shows that latency was disrupted in an unrelated region, for a duration matching the p50 recovery in other regions. From this, we can see that “high levels of network congestion in the eastern USA” also had a much broader impact than just us-east.

Conclusions

Real time observability of the performance of cloud service APIs is necessary to have timely understanding of the range and size of impact an outage has on your organization. Status page updates will often be delayed by tens of minutes and will not include enough detail to be actionable. Reliable high resolution graphs of performance enable you to understand impact outside what is documented on the status page — and have the hard conversations you need with your cloud providers (as well as the data to support your case).