There’s a quote by Grady Booch - “The function of good software is to make the complex appear simple” - that defines a lot of what we strive for at Lightstep. Many observability tools are designed to give you, as an end user, a lot of freedom and power to query your data and dredge insights from it. This model, while popular, has some flaws - it tends to be rather inegalitarian, requiring a high level of comfort with not only the tool, but also a strong understanding of the system being monitored. In contrast to this model, Lightstep is designed to surface interesting insights using statistical analysis and dynamic sampling to provide unparalleled access to the data that matters while also aggregating traces to provide accurate SLIs for your key operations and services. One frequent question I get, however, is “How?”
Lightstep, fundamentally, is built around collecting 100% of unsampled trace data from application services and then using that trace data to drive “root cause analysis” workflows. More simply, we look at traces to answer the question of “what’s changed?” between two arbitrary points in time. In this blog, I’ll summarize a presentation we gave at Performance Summit IIIPerformance Summit III about how our correlations and dynamic sampling approaches provide insights into the performance and health of our customers' services.
The key visualization that we use in Lightstep isn’t an icicle graph, but a histogram. Averages and quantiles, even P99, have a tendency to hide distinct behaviors in your application. Histograms allow us to not only provide a visual representation of “what’s happening”, but also “what’s changed” by overlaying histograms from different points in time over each other. Within these histograms, however, there exists a potentially unbounded set of subpopulations that you may care about - for example, a service might be scaled horizontally to hundreds of distinct compute nodes, but only some fraction of those nodes could be contributing to the latency of a request. This is the basic idea behind our correlations functionality - determining what attributes of those requests are important, and surfacing that information to you in an understandable way.
Fundamentally, this correlation analysis is driven by calculating the Pearson correlation coefficient for attributes that appear on spans within some set, giving us a value between -1 and 1. This gives us a fairly straightforward score that we display to end users, and allow us to overlay histograms that display the relative contribution of a particular attribute to the overall latency distribution. That said, it isn’t all sunshine and roses - Pearsons performs poorly when there’s a nonlinear relationship between variables, for example. More specifically, skewed data sets really can throw a wrench in the works, and trace sampling is a silver bullet to produce skewed data sets. We’ve tested using other correlation algorithms, like Spearmans, and found that there isn’t a “single best” choice - we err on the side, then, of displaying as many of the moving parts as possible, to help guide our users towards the most relevant information.
Related to this, then, is sampling. If we’re able to understand what is “interesting” through correlations, then we’re able to use that to inform our sampling approach. In general, we bias towards capturing the tail behavior of a service and extraordinary (read, ‘error’) cases. We have some constraints - our input can be of a potentially infinite stream of trace data, but our output needs to be a representative sample of the system state at some arbitrary point in time. We use the concept of ‘ingress services’ - so, the entry point to a service be it an API endpoint, RPC endpoint, etc. - to guide our decisions. In general, we’d like to bias towards the traces that are most likely to be useful in understanding why things are happening at any given time, and we need to make those decisions without coordination. With all of this in mind, we turned to VarOpt samplingVarOpt sampling. We’re able to use correlation data, hints about the system shape (such as ingress vs. egress operations), and other semantic information available on our spans to guide the sampling decisions while also controlling for the amount of CPU, memory, network, and storage that we’d like to use to process, analyze, and store those traces.
If you’d like to get a better understanding, I highly encourage you to watch the video below for this talk as it provides several helpful visualizations to illustrate the points made above. In the future, we hope to continue improving our ability to dynamically sample and provide more accurate correlations by building on the work being done in the OpenTelemetry project to provide detailed semantics for trace data, and to use new and emerging technologies to detect interesting events and outliers within trace data through ML and AI. If you’d like to be a part of this work, we’re hiringwe’re hiring - join us! Finally, if you want to try this all out for yourself, sign up for a free developer accountfree developer account and start sending us traces today.
Applying Statistics to Root Cause Analaysis
Taras Tsugrii (00:04): Last event, we had an amazing talk from Alex, from Lightstep, who was talking about performance is not a number. And I guess as the continuation of his talk, Karthik, today, is going to present his amazing work on applying statistics to root cause analysis.
Karthik Kumar (00:26): Hi, my name is Karthik Kumar, and I'm here to talk about Applying Statistics to Root-Cause Analysis. A few quick intros, I'm a software engineer and I'm building root cause analysis tools at a company called Lightstep, at Lightstep our mission is to build simple observability for deep systems, and we're focused on mainly distributed tracing. Our CEO and co-founder created [Dapper 00:00:52] , which is the distributed tracing tool used at Google. So the topics I want to cover today are two common problems with telemetry data, specifically around tracing. The first is around complexity and how we can maximize the insights, we can get from tracing data, while minimizing the complexity. We want to ensure that the monitoring software isn't more complex than the application itself, I also want to talk about how we can gather interesting data that we or the user cares about while minimizing the cost or the overhead of such high volume data.
Karthik Kumar (01:32): A PM shared this with me recently and I think it's a good quote for this section. It says, "The function of good software is to make the complex appear to be simple," we want to build good tools to simplify the complexity of analyzing software performance. So I think it's pretty clear that as systems get more complex, traces and other sources of telemetry data become more complex. Traces can provide an end to end view of the request path and analyzing them in aggregate can be really powerful, and a complex multilayered architecture like this one, where there's requests flowing from some mobile or web client through some microservices or some type of backend. Having tracing data can provide a good end to end view of the entire request. It can have rich contextual data, but root cause analysis can still be a little difficult and expensive.
Karthik Kumar (02:28): So that's what this talk is about. How can we maximize the insights and minimize complexity? And my hypothesis is that we can use statistics to make this possible. The first topic I want to talk about in this section is around distributions. At last year's Performance Summit, a coworker of mine talked about modeling performance as a shape and not a number. He advocated for using histograms and not just averages or percentiles to measure performance, and the main reason was that using averages and quantiles, like even P99, can hide the distinct behaviors in your application. And they aren't helpful for understanding complex systems that have varied behaviors. The other benefit to modeling this as distributions is that we can visually compare changes in performance, which are harder to do when it's a single number of you're looking at.
Karthik Kumar (03:17): I want to motivate this example by showing a few common behaviors with latency distributions. So here's three common patterns that we see in software performance and how we can use histograms and distributions to model that behavior. So the first one shows a long tail latency, so there's a few large requests towards the right side of the histogram, which represents the slowest requests. The second example has some type of either cache hit or some error code path on the left side and some normal path on the right side, which results of this split bi-modal distribution.
Karthik Kumar (03:55): The third request could be something around different classes of requests. Some that are expensive, some that are cheaper, and there's also fears as a bimodal distribution. If you notice, between the second and the third example, there's not much difference in the P95. So that's the number that you cared about, you probably wouldn't even notice that there's a difference in the behavior of these two distributions. And it really shows the power of how we can immediately get a sense of how the system is behaving.
Karthik Kumar (04:27): So the other thing we can do with this is to compare distributions, and we can do so visually by just overlaying histograms from different time periods. And this example, if you notice the vertical lines over here, the P95's didn't actually change that much, but the blue line represents something that was happening before the deployment and the yellow bars represent after deployment. But because of the bars that we see closer to one second, this probably indicates a regression and warrants a rollback. And we're able to immediately get this insight by just looking at a distribution, which is not possible by looking at P95.
Karthik Kumar (05:06): So that's all I wanted to say about distributions. My main topic here is around correlations and how we can apply some basic statistics to find associations between behaviors and different subpopulations. And the behaviors that we care about are mainly latency and errors, but this could also be something that user cared about that they're telling us. And the subpopulations are things like spans on tags or services and operations, basically things that are information or properties of your system. And we can actually automatically identify these subpopulations and surface the ones that have a sufficient correlation to a certain type of behavior. And our goal is also to present this in an understandable way.
Karthik Kumar (05:53): So I want to present this as what we did to explore this project and my guess is that this can be applied to other similar problems that you may be dealing with at your company. This actually started as a Hackathon project. We used something called the Pearson correlation coefficient to find a simple, linear correlation between two variables, X and Y. And it returns numbers between negative one and one, so values close to negative one are negatively, strongly correlated between the two variables. And positive one is the opposite. And values close to zero indicate things that have low correlation. I've written potentially binary up there because we're going to look at some examples where we encode the properties of traces as binary variables for things like the span has this tag, or this span is in some window. And the images that I've picked show real value variables, but the idea is pretty similar. It's just an implementation detail.
Karthik Kumar (06:49): So I wanted to next share a few examples of how this works in practice. So on the statistics side, we're correlating two X and Y variables. The X variable spans with a certain tag, some payments is succeeded tag, and the Y is some latency that we're seeing in the sample of data. And how this is presented to the user is, like on the right side, there's a payment status succeeded tag that has a 0.72 correlation coefficient. And this indicates that there is some strong correlation between that tag and slowness in our data. And that's represented visually here with something similar that we saw earlier, something similar, which is overlaying histograms.
Karthik Kumar (07:37): So the black bars represent spans that have this tag and the blue bars represent all the spans in the sample. And we can take this data and immediately understand that, since there is a positive correlation, we should expect to see the black bars closer to the right side of the histogram. And that is what we see. But there are some examples of some occurrences of this tag on the faster side of the histogram also. Similarly, we can look at a negatively correlated example with latency. In this case, a spans with certain tag, like HTTP method GET is negatively correlated with latency. And that appears as a negative correlation, negative 0.8 here. And we can show this to the user by highlighting the left side of the histogram and the black bars represent the faster spans have this tack.
Karthik Kumar (08:32): We can also correlate with errors. And in this example, there's a perfect correlation with errors in the sample. So this is a span with the 400 tag for HTTP status code. And this, we don't really look at this in the context of a histogram because we're looking at errors, but in this case there was a perfect positive correlation with errors in this tag. We can also correlate with user specified behavior. So, so far we've looked at correlating against latency or errors, but we can also look at behavior that the user is interested in. In this case, they can select a region of the histogram, it may not be the slow ones, but it may be something between 500 milliseconds and a second and a half. And they want to know what spans up here on this tag? So now the Y becomes spans inside the region that they have selected. And in this example, there was a span that had certain service and operation on the critical path that was contributing positively to latency.
Karthik Kumar (09:36): And it actually works. This is the best type of feedback on Twitter. And we've seen our customers who use this feature to actually find issues in their software without having to sift through large numbers of traces. But there's definitely pros and cons to this approach, but first I want to show you the actual math behind what's happening. The Pearson coefficient is pretty straightforward. It's basically computing a covariance of two variables, X and Y, and it divides that by the product of their standard deviation. Because of that denominator, the unit of measurement doesn't actually affect the calculation, which is a big benefit here. We can correlate with two variables that are measured in different units, and it would still return a valid value.
Karthik Kumar (10:27): It's also simple to understand and implement, and it works well for most cases. When it doesn't work well is when the data set has a nonlinear association between the variables, and we'll look at some ways that we've dealt with that, and there's also the possibility of type one and type two errors since the dataset is a sample of the entire population. But our goal wasn't to be 100% confidence in the data, and that's why we expose the correlation coefficient to the user, we want to show them a confidence score to guide them towards hypothesis validation.
Karthik Kumar (11:03): So I talked about how skewed data wouldn't work well with Pearson correlation coefficient. And so I have an example here of some highly skewed data set, and we can use two techniques to make this data easier to interpret and use statistical analysis on. One is the log simple/log transformation where we take each data point, and take log base 10, and another is to rank the data points on a scale of zero to 100. So basically find the percentile rank of each data point. And this kind of transformation can be helpful in making the data more interpretable and to help make it so it can allow inferences on highly secure distributions.
Karthik Kumar (11:48): We've also looked at correlating on nonlinear data sets through some other algorithm. We used Spearman's to just experiment on computing correlations and finding a monotonic relationship between the rank of two variables. And it actually works well when there's a nonlinear distribution and it's more resistant to outliers, but actually our testing didn't show much difference in the difference between the results between the two algorithms. So we haven't actually implemented this, but I wanted to share this as something that might work for your use case. So with Spearman's, it looks at a distribution of data and it tries to find a correlation with any monotonically increasing function. So in this case, Spearman's returns one, but Pearson doesn't return a perfect correlation here. Even though it seems like as X increases, Y also increases.
Karthik Kumar (12:41): We see similar results without outliers, so if there's no real correlation in the data, both Spearman and Pearson returned similar coefficients. But Spearman's does perform well when there are outliers, so Spearman's correlation shows data that are grouped together and tries to find correlation and is less resistant to those points on the right. We can also correlate with more properties, I've shared a few so far, like spans on a tag, tags on a span or certain services and operations. But we can correlate against more complex things, like correlation, since a subpopulation that we're identifying is just some feature of a trace we can correlate on certain call patterns, like serial or scatter-gather. We can correlate on logs with spans, the existence of certain spans up and down the trace, so it's really like a platform to build other types of analysis as we think of them on.
Karthik Kumar (13:37): So a few takeaways from this section, we talked about how tracing data is noisy and complex and using histograms to model performance is really helpful in understanding visually what's happening in your system. And we can use simple, statistical analysis to expose patterns in your data. You can guide hypothesis validation, and we can optimize the process of finding the root cause by not having people sift through a trace by trace finding what tag appears on which trace, and whether that's correlated with anything manually. So I hope I've convinced you now that you can do some cool analysis with traces and you might be wondering at what cost, and that's the next topic I want to talk about, data quantity.
Karthik Kumar (14:23): We obviously have a lot of trace data and as our systems get more complex, traces get more complex. So we want to figure out what is the best data to capture. And so the topics here I want to cover are around bias and sampling. So the first question to ask is what data does the user even care about? So we have a fire hose of traces coming in, and we want to identify the traces that are relevant to the user. And so our goal should be to focus our sampling budget on the interesting traces. So we define anything that the user cares about as interesting, and these could be real time queries that they're running currently, or queries they've saved in the past and said, "This is something I want to track the performance or behavior of."
Karthik Kumar (15:07): But we also do another technique. The first, looking at just the items the user told us to care about isn't actually sufficient. We need something that gets a constant source of data for each service that they've instrumented. So we introduced the idea of using ingress operations to identify the entry point of the service and gathered traces and other statistics for those ingress operations. And you might be wondering why we look at ingress operations. Well, that's usually what's used to report SLA's, so those are the operations that people care about for their service. And we can detect what an ingress operation is by looking at service boundaries and explicit tags that the user can set, saying this is an ingress operation.
Karthik Kumar (15:53): I also want to motivate why we want to bias the data. Bias usually sounds like a negative term, but in this case, we want to be guiding users to the root causes. And we've already shown that it's possible to automatically identify the subpopulations that might be interesting to them. So our goal with sampling is to capture enough traces that exhibit as many different, interesting subpopulations as possible. So if we have a bunch of post sample data that just looks like normal behavior, this isn't useful to the user. We want to bias our sampling towards the traces that have abnormal behavior.
Karthik Kumar (16:31): A quick aside here, the tracing architecture at LightStep looks something like this. It's pretty similar to other platforms and other systems, but we have a trace library that's attached to microservices, which are usually the most common source of data. But there's others. And that data gets to sent to satellites, which are basically collectors, and our backend SaaS does queries back and forth to the satellites [inaudible 00:16:56] data. And there's three levels of sampling that we can do at the library, trace library itself, the collector and at the SaaS. And I'm going to focus mainly on the SaaS in this talk.
Karthik Kumar (17:07): So we talked about why we would want to bias a sampling earlier, and the next natural question is, how do we bias a sampling? So we want to bias towards air traces and traces across the latency region, usually when you hear the word sampling, you might think about uniform sampling, which is to randomly pick a point at somewhere along the distribution. But we want to bias towards capturing the tail behavior and we want to make it so, in a distribution that's a skewed like this one, we want to equally, likely pick a trace from any point in this distribution.
Karthik Kumar (17:43): And we have a few requirements, also at the SaaS, around sampling. So we have an input of some stream of traces of unknown length, and we want to output a representative sample, and we want to use that sample later to answer some questions about the original population. We don't know the questions it will ask yet, but this could be any type of questions that any type of features that we build that would want to build off of this data. We want to make efficient sampling decisions, we don't want to be spending too much time actually flipping the coin to decide whether or not to keep a trace, and we want to make sure this works in a distributed setting. So there's no centralized coordination between different sampling locations. We want to make sure each decision is made on a local basis.
Karthik Kumar (18:25): And we actually found a solution in academia. There's a paper from 2010 called VarOpt Sampling that introduces a online reservoir sampling scheme. So reservoir sampling is the technique where you keep a reservoir of representative items that's some number smaller than the entire population and you use those items to later answer questions about the entire population, which is exactly what we want to do. We can define each input item and give it a certain weight, and we call it importance, and this is something we can define ourselves. So this is where we introduce the bias. We want to add more importance to things that have latency, or errors, or version changes, or even any other product feature that we think of in the future. And VarOpt ensures that there's a cap on the items with the highest weights.
Karthik Kumar (19:17): It also produces an adjusted weight that's different from the input weight for each sample item. So this basically answers the question, how many other traces like this one were present in the population? And VarOpt meets our requirements, it minimizes the average variance over the subsets and we care about subsets here, this might not be intuitive, but we care about the subsets because subsets sum weights can be used to answer the quantile queries that we care about. Like, what percentile does this trace belong to in the population? It makes sampling decisions very efficiently in O of log K time, where K is the number of items we care about keeping. And it works in a distributed setting.
Karthik Kumar (19:59): So I want to present how this works in a visual way. So we have a bunch of traces being input into our SaaS, and these are traces for ingress operations, and we assigned some importance to bias towards the interesting traces. And again, interesting here could be, entire latency region covering errors from higher traffic satellites and so on. Those traces go into VarOpt and VarOpt samples some K number of those items and it guarantees that it will minimize the average variance of any arbitrary subsets that we want to draw that draw from that data. And it does so in N times log K time. And we can take those traces output from VarOpt and store them and analyze them at some point later. And we can take subsets of those traces and use the adjusted weights to calculate some quantile measurements. So in this case, this adjusted weight of 33 in this first example means that there were 33 other traces that were similar to this one.
Karthik Kumar (21:00): And we can use those numbers to go back to calculating accurate percentiles of the whole population. There's a letter K here that is how many items we want to sample, and productionizing this means that we can decide how many traces to actually sample. I'm not going to go into too much detail about this, but K just indicates how big do we want our reservoir to be. How much money do we want to spend on CPU, memory, network, and storage? And the way we solve this is to look at a rolling window of recent behaviors that we're getting from this population of traces, for this ingress operation. So when a certain operation sees a spike in latency or errors or throughput, then we increase our K to appropriately also fill up our reservoir with more traces. We increase the size of our reservoir based on the behavior we're seeing.
Karthik Kumar (21:57): So a few takeaways from the section, tracing data is very data intensive, but it's not all worth analyzing. There's obviously noise in the data and we have several opportunities to sample and each one has different constraints and requirements, but we want to be on the SAS side, on the backend. We want to bias towards storing and analyzing the interesting traces. That's where we want to spend our budget on. And we should be flexible in defining what it means for trace to be interesting, and so what I presented as a solution that worked for us, which was VarOpt sampling.
Karthik Kumar (22:30): So in summary, I talked about data complexity and minimizing the complexity of this data and maximizing the insights we can get through distributions and correlations. And I talked about tracing data quantity and minimizing the costs and the overhead of this data, while maximizing the relevancy and keeping data that we consider to be interesting through biasing and sampling methods. And that's it. Any questions?
Taras Tsugrii (23:00): Thank you so much, Karthik, you give us an amazing talk as always. Super informative. It looks like we currently don't have ... I'm going to just go ahead and ask the one that I'm very much interested in. So most of this is, with regards to monitoring distributed systems and microservices, I'm wondering if you have any experience with using this for observing performance of mobile applications.
Karthik Kumar (23:38): Yeah, you're right, this is targeted towards more server-side monitoring, but we have customers of Lightstep who use tracers are on their mobile devices to send traces on what's happening between the mobile-to-backend communication. And they can use pretty much all of the tools I talked about to analyze that behavior and I don't think we have anything specifically geared towards mobile, but the same idea should still be relevant.
Taras Tsugrii (24:10): I see. And I assume that you're using open telemetry for instrumenting.
Karthik Kumar (24:16): Yeah.
Taras Tsugrii (24:18): That's why it should be more or less uniform.
Karthik Kumar (24:21): Right.
Taras Tsugrii (24:22): Thank you for working on open telemetry projects, being such an active contributor. I mean, Lightstep is contributing a lot and I really hope to see the entire industry converging on the same standard, that would definitely simplify life for everyone.
Karthik Kumar (24:41): Yeah. Yeah. That's definitely a big priority, we have an entire team dedicated towards contributing to open telemetry.
Taras Tsugrii (24:49): Yeah, that's absolutely awesome. Let's see if there are any questions. [crosstalk 00:25:03] Yeah, yeah, so there is one. So as I understand this, VarOpt sampling helps in analyzing long tails is that correct?
Karthik Kumar (25:13): So VarOpt sampling helps us decide which traces to keep, it doesn't help to do the analysis I talked about earlier. That's a separate process that happens after. Basically VarOpt helps us take this huge stream of data coming from satellites and we can input our bias on what we consider to be relevant for our later analysis. And VarOpt helps us do that in a way that minimizes the variance. So the VarOpt name comes from optimizing variants and it provides us a way to take all this large number of traces and give it some importance that we decide, and then it tells us how many traces to consider as being part of the entire population. So then we store that and at some later point, when a user, a performance engineer, is interested in analyzing performance, they can use the traces that we've assembled, that VarOpt decided were relevant to keep. So it's not necessarily analyzing long tails, it's in support of that, but it's more to condense down the amount of data that we need in order to do the same type of analysis I talked about.
Taras Tsugrii (26:27): Awesome. Thank you. There's no more questions, looks like.
Karthik Kumar (26:48): I see one in Slack.
Taras Tsugrii (26:51): Yep, yep. Yeah, just go ahead, I guess.
Karthik Kumar (26:57): Yeah so the question is, "You talked about tagging traces with features, example based on the presence of a given event, how long are these traces and are they gathered with respect to trigger specific to the latency metric you're optimizing? Wondering how you'd go about engineering more complex features." Yeah, so the timing of the features, we respect tags that appear on the span. So things like deployment version or host name or any type of tag that a developer considers important for their application, and these traces can be pretty long. They could get up to, I think, thousands of spans long and several minutes in duration.
Karthik Kumar (27:34): And they're gathered with a bias that we input, this is the topic on the VarOpt that I spoke about, and they're not necessarily gathered with the correlation that we plan to do. It's gathered with the intention of doing some type of analysis. And then later we can run the correlations engine to use the traces that we've gathered to pick out when a certain tag occurs with [inaudible 00:28:01]. So the engineering more complex features comes in on the analysis side, where we can think of more ways to correlate behaviors with subpopulations and use the traces that we've already gathered. I hope I answered that, George.
Karthik Kumar (28:22): And Matt had a question, "I didn't quite follow how VarOpt makes decisions without coordination." So the main point that I intended to convey with that was just that VarOpt doesn't need any type of centralized coordination, it's basically just a piece of code that we've written to say, "Given a trace, how valuable is this?" That's the only input that we're giving it, and this doesn't need any type of coordination. It's mainly looking at aspects of the trace itself. So, a feature of the trace, how long was it? Was it an error trace and so on? So there's no centralized coordinator that's managing this, it's making these decisions locally to look at the properties of a trace and assign it some weight. And it takes a bunch of these in batches and in a streaming fashion, and when it outputs a number of how many traces to keep, we can use that number based on our constraints of how many traces would we actually want to keep. So it's more that the only input we're giving it is just how much weight to give to this trace, so there's no need for any coordination.
Karthik Kumar (29:34): And Rico asked, "I'm a big advocate of distribution based diagnosis, do you have a sense of how difficult or easy it was for people to get used to seeing these two color distribution overlays? Did they find it natural? What do you do to report up to execs when there are many statistics. Describe." Yeah, this is actually a big concern we have, or a big area of improvement. Distributions aren't inherently understandable, I think. When developers look at charts, they immediately think of it as a time series. And the charts I showed in my presentation are actually in a log/log axis, which also is kind of confusing to users. The thing that we've tried to do is to simplify what is actually being presented and not overwhelming the user with numbers that may or may not be relevant when you're just communicating a behavior.
Karthik Kumar (30:25): And the correlation coefficient actually was also confusing to users, so we went to a different way of showing a range of good to bad, as in green, yellow, red. So these are some ways we've tried to simplify this. And the overlays themselves have also been a little confusing, especially since some of them are lines and some of them are bars, so it's an active area of development, I would say. We haven't definitely figured it out, but I think that it's more of like a UX challenge, at this point. Since we already have the data, it's about how we can present it in the way that makes the most sense. It's definitely a challenge.
Karthik Kumar (31:08): Oh, I see one in BlueJeans, "Can you talk about the different factors contributing to bias and important? What makes traces interesting?" So I mentioned this a little bit, the things we consider to make a trace interesting are ... we basically bucket the trace into certain fixed sized buckets of a histogram, and things that appear towards the right side of the histogram, and things that are slower, we give it more importance. Things that are errors we give it certain importance, and things that are coming from high traffic satellites we give it more importance.
Karthik Kumar (31:51): And this is another area that we're iterating on. We recently added the ability to monitor deployments, so we gave things that have certain version markers, or when we detect a change in version markers, we give that more importance. So this is where we can inject any business knowledge that we gained from looking at customer data and any features that we might want to build. And that's what we can use to decide what is interesting and what we want to keep.
Taras Tsugrii (32:24): Awesome, thank you so much Karthik. It was super informative. I think everyone thoroughly enjoyed your presentation.
Interested in joining our team? See our open positions herehere.
September 23, 2020
31 min read
About the author
Austin ParkerRead moreRead more
Explore more articles
Let's Talk About Psychological SafetyAdriana Villela | Jul 12, 2022
System outages are rough for all involved, INCLUDING those who are scrambling to get things up and running again as quickly as possible. Psychological safety is crucial, ensuring that employees are at their best & don't burn out. Read on for more on this.Learn moreLearn more