Using a Mystery Shopper: Discovering Service Interruptions in Monitoring Systems
Many retail stores use mystery shoppers to assess the quality of their customer-facing operations. Mystery shoppers are employees or contractors that pretend to be ordinary shoppers, but ask specific questions or make specific complains and then report on their experiences. These undercover shoppers can act as a powerful tool: not only do organizations get information on their employees’ reactions, they don’t need to depend on ordinary shoppers to ask the right questions.
At Lightstep, we faced a similar problem: we wanted to continuously assess how well our service is monitoring our customers’ applications and to identify cases where they are failing to meet their SLAs (or more properly, their SLOs). However, being an optimistic bunch, we don’t want to rely on our customers applications to continuously fail to meet their SLAs. :) We needed another way to test whether or not Lightstep was noticing when things were going wrong.
To provide some context, Lightstep is a reliability monitoring tool that builds and analyzes traces of distributed or highly concurrent software applications. (A trace tells the story of a single request's journey from mobile app or browser, through all the backend servers, and then back.) As a monitoring service, it's critical that we carefully track our own service levels. Part of our solution is what we call the Sentinel. From the point of view of the rest of Lightstep, the Sentinel looks just like any other customer. Unlike our real customers, however, the Sentinel’s behavior is predictable, and it is designed to trigger specific behaviors in our service. (We named it the “Sentinel” both because it helps keep watch on our service, but also because it creates traces with the intention of finding them later, and so it’s similar in spirit to a sentinel value.)
To understand what the Sentinel does, you’ll first need a crash course on Lightstep: as part of tracing, every component in an application (including mobile apps, browsers, and backend servers) records the durations of important operations along with a log of any important events that occurred during those operations. It then packages this information up as a set of spans and sends it all to Lightstep. There, each trace is assembled by taking all the spans for an end-user request and building a graph that shows the causal relationships between those spans. Of course, assembling every trace would be expensive, so choosing the right set of traces to assemble is an important part of the value that Lightstep provides.
In designing the Sentinel, we first identified two important features of Lightstep: assembling traces based on request latency and alerting our customers when the rate of errors in their applications exceeds a predetermined threshold. To exercise these features, the Sentinel generates two streams of data. The first is a kind of background or ambient signal: a set of spans that represent ordinary, day-to-day application requests. We ensure that the latencies of these spans test the limits of our high-fidelity latency histograms, and, most importantly, we check that the number and content of the assembled traces matches our expectations.
The second stream of spans represents a set of applications errors. This stream periodically starts and stops, and each batch of errors exceeds the SLA threshold and causes an alert to trigger. Moments later, after the batch ends, the alert becomes inactive. On and off, on and off, all day long, these spans trigger alerts, and we verify each one.
The Sentinel has helped us discover incidents that other monitoring tools haven’t and avoids spurious alerts that might have been caused solely by changes in our customers’ behavior. We’ve found the Sentinel to be a particularly powerful technique when used in combination with a load test. While the load test simulates an unruly customer, the Sentinel acts as a well-behaved one. Using them together means that we can verify that one doesn’t interfere with the other.
Why not just use a health-checking tool like Pingdom? Of course, we use tools like those as well, but we’ve found that the Sentinel enables us to test more complicated interactions than off-the-shelf health-checking tools. Assembling traces from complex applications can be… well, complex, since spans from even a single trace can come from different parts of an application and may arrive out of order. No single span has the complete picture of what’s happening: in fact, the point of assembling a trace is to show the complete picture! Another way of saying this is that the correctness condition for trace assembly is defined globally: only by considering many different API requests (and their results) can we say whether or not a trace was assembled correctly.
Isn’t this all just an integration test? In a way, yes, but we see integration testing as a way of validating that our code works, while our online monitoring, including the Sentinel, ensures that our service continues to work. We explicitly decided that we wouldn’t try to use the Sentinel to cover all of Lightstep’s features. While coverage is important for integration testing, we wanted the Sentinel just to test the most important features and components of Lightstep and to test them continuously. Picking a subset of features helps us keep the Sentinel simpler and more robust.
The Sentinel acts a mystery shopper, letting us carefully control the input to Lightstep and validate the results. You might find a similar technique is valuable, especially if the behavior of your service can't be tested with a single API request and where there are complex interactions between requests, including time dependence or the potential for interference with other systems.
For example, if you have a product that includes some form of user notification, you might want to test the following sequence:
- Set up a notification rule
- Send a request that triggers the rule
- Check that the notification is sent
Continuously exercising this sequence can give you confidence that your service is up and running. Just don’t forget to remove the notification rule so that it can be tested again!
As in the case of any testing or monitoring, think about what matters to your users. What features do they depend on most? Just as a retail store manager can hire a mystery shopper to ask the right questions, you should use monitoring tools to verify that your most important features are working to spec.