Know What’s Normal: Historical Layers

In April, we released real-time latency histograms and explained why performance is a shape, not a number. These latency histograms change the game when it comes to characterizing performance and identifying worrisome behaviors, but these behaviors immediately beg new questions: “Is this normal? Did something change? If so, when?”

Today, we’re announcing historical layers for LightStep [x]PM Live View. Historical layers allow you to compare the up-to-the-second latency histogram against the performance shape from an hour, a day, and a week ago. When you interactively filter the latency histogram to restrict its focus on a specific service, operation, and/or collection of tags, the historical layers will also reflect the specified query criteria. Now with just a glance, you can determine whether performance behavior has improved or degraded for any aspect of your application (and when that change occurred).

LightStep Adds Historical Layers to [x]PM for Performance Management
Historical layers quickly show when performance has improved or degraded for any aspect of your application

We chose the different time intervals deliberately to cover a wide range of scenarios and to account for common cyclical performance variations. This new capability is designed for firefighting time-sensitive issues, investigating latency spikes to isolate root cause, and for validating whether application changes are exhibiting the expected outcome over time. Historical layers make it easy to spot even the most subtle and harmful performance regressions.

LightStep [x]PM captures and stores the data and statistics required to produce these historical layers through its unique Satellites. In contrast to traditional time-series data, this information is available automatically and doesn’t require any additional configuration or preparation. You can also filter using high-cardinality tags – so you can view this level of detail for virtually any aspect of your application: from specific services or product versions to individual customers.

We’re extremely proud of these new capabilities and encouraged by the enthusiastic feedback we’ve received from our early beta customers, but we’re certainly not stopping here! In the coming months, we’ll be delivering more unique capabilities that will complement historical layers to make high-fidelity performance management and monitoring even more intuitive and insightful for our customers.

Are you adopting microservices? Contact us to learn more and see exactly how LightStep [x]PM works.

LightStep and OpsGenie Partner to Improve Application Performance and Incident Management

Microservices-based architectures enable software teams to deliver innovations and value to their customers faster. Microservices are often owned by individual engineering teams that are solely responsible for everything from development to deployment. This autonomy reduces cross-team dependencies, but it also often means each development team is solely accountable for the ongoing performance of their own services in production. Using LightStep [x]PM and integrated solutions such as OpsGenie, a leading incident management platform, teams are proactively alerted when potential SLA violations or latency issues occur, and can see the associated end-to-end traces to pinpoint root causes quickly.

LightStep [x]PM is unique because it analyzes completely unsampled trace data and is able to segment this information by extremely high-cardinality key:value tags, such as customer IDs or build numbers. This means [x]PM captures every performance anomaly or failure, no matter how brief or rare the occurrences are. [x]PM is the ideal solution for companies with microservices-based applications, because users can isolate real-time and historical performance data along any dimension and uncover root causes even for complex transactions spanning service boundaries – letting teams focus on the issues they’re responsible for.

LightStep and OpsGenie Partner to Improve Application Performance and Incident Management
LightStep [x]PM with OpsGenie alerts the right people based on on-call schedules when SLA violations or latency issues occur.

Useful information is meaningful only when users receive it when and where they need it. [x]PM has been integrated with complementary DevOps solutions to allow teams to access their performance data in their preferred, existing workflows. Our customers requested that we integrate with the OpsGenie incident management platform for operating always-on services. When a Service-Level Alert (SLA) is violated or resolved, [x]PM sends JSON notifications to OpsGenie, which automatically creates custom alerts and notifies the right people based on on-call schedules – via email, text messages (SMS), phone calls, iOS and Android push notifications, and escalates alerts until they are acknowledged or closed.

We’re excited about the value of this new integration for our customers. We’ll continue to enhance [x]PM to work well with popular tools and DevOps best practices for adopting, developing, and maintaining microservices-based applications.

Try it out and share your feedback at support@lightstep.com, and let us know what other integrations you’d like to see.

A First in APM, Only Possible with LightStep [x]PM

LightStep [x]PM analyzes 100% of unsampled transaction data for all of our customers’ production systems. This wealth of information can be used to quickly explain root causes and resolve time-sensitive issues like errors and latency spikes.

We’re releasing a set of new features in Live View (formerly Latest Traces) that makes it even easier to discover and categorize the different latency behaviors in any application, and to pinpoint within seconds exactly what’s slow for those categories.

Live latency histogram

This new chart shows the global latency distribution for the full set of spans, or timed operations, that are currently in memory across every [x]PM Satellite within your environment.

Below the histogram, [x]PM shows some example spans from the same dataset to give you a better idea of the transactions currently going through the application. You can use the search box to limit the scope of the histogram and the example spans below to a specific Service, Operation, and/or custom Tags. You can also click and drag along the latency axis of the histogram chart to further narrow the set of example spans to include only those in a specific latency range. These powerful features make it possible to flexibly examine the real-time latency characteristics for an application in its entirety, through monoliths and microservices, or filter by any dimension, no matter how focused or broad.

LightStep [x]PM Filter HistogramFlexibly examine real-time latency data to categorize system-wide behaviors and drill down to pinpoint specific problem areas.

Sub-trace preview

[x]PM now also shows key information about the sub-trace for each example span shown in Live View. A sub-trace is the portion of an overall end-to-end trace that consists of the displayed span itself and its descendant spans. The Sub-trace Summary column now shows the total number of spans in each sub-trace. Hovering over the rows in this column will show the most time-consuming operations within that particular sub-trace. Each span’s duration is also now depicted visually as a horizontal bar that varies in length in relation to the other example spans shown, making it much easier to spot outliers and anomalies. These enhancements are designed to help users quickly find spans that are abnormally slow, spans that contain large sub-traces, and spans that may be behaving unusually in other ways.

LightStep [x]PM Latency BreakdownSub-trace preview lets you quickly spot outlier spans that are abnormally slow and the operations that are causing the slowdowns.

Search enhancements

Lastly, we made some usability improvements to the search box in Live View. There is now a guided search builder that more clearly categorizes search terms by Service, Operation, and Tags. There is no practical limit to the number of search terms that you can include. When adding a new search term, you can scan the drop-down for a list of suggestions or begin typing to quickly find what you’re looking for. Past searches are saved locally and can be viewed by clicking on the Recent Searches tab.

LightStep [x]PM Recent SearchesImproved search functionality makes it easy to quickly find what you’re looking for.

Learn more

Register for our upcoming webinar, LightStep [x]PM Latency Histogram, airing Wednesday, April 25th at 9:00 am PT, 12:00 pm ET, and see a live demo of these powerful new features.

We’re thrilled to introduce these core features to [x]PM, and believe these new capabilities will dramatically transform how you approach performance management today and help you discover and fix reliability issues quickly and more efficiently. We hope you’re just as excited as we are!

LightStep [x]PM Releases Role-based Access Control

We’re excited to release role-based access control for LightStep [x]PM, with support for three different levels of permissions.

Administrators have the highest level of access, with organization management privileges that include creating, editing, and deleting users, projects, and roles.

Members have full access to [x]PM’s core features, which include viewing end-to-end traces and customizing dashboards, alerts, and Streams, but they do not have organization management privileges.

Viewers have read-only access to [x]PM’s monitoring capabilities – perfect for onboarding new users to [x]PM without disrupting existing workflows.

For more detailed information, visit our documentation page and log in.

Role-based access control (RBAC) vs attribute-based access control (ABAC)

When architecting access control for [x]PM, we evaluated existing industry-wide access control standards. The two frameworks we focused on were (1) Role-based Access Control (RBAC) and (2) Attribute-based Access Control (ABAC).

With RBAC, roles are created and sets of permissions for resources are assigned to each role, and users are granted one or more roles to receive access to these resources. For example, you can have a user that has the role “Member” which maps to a set of permissions like creating alerts, editing dashboards, etc. On the other hand, ABAC gives access to resources via attribute-based policies. ABAC policies are rules that evaluate access based on four sets of attributes including: subject attributes, resource attributes, action attributes, and environment attributes. For example, you could have a policy like “all users in engineering in San Francisco should have edit access to all dashboards.” Here, the subject attributes are {“group”: “engineering”, “region”: “San Francisco”}, the action attribute is {“action”: “edit”}, and the resource attribute is {“type”: “dashboard”}.

We decided RBAC was a better fit for [x]PM for the following reasons. First, the primary advantage of ABAC is the ability to support extremely fine-grained permissioning, which is overkill in our product as we have no need to control access based on specific user resources (e.g. Only employees in the HR Department who are located in New York can access this dashboard). In addition, ABAC makes it extremely difficult, if not impossible, to determine the permissions available for a particular user. Whereas with RBAC, it’s much easier to audit the set of users that have access to a given permission, and the list of permissions granted to a particular user. There is also less overall complexity with RBAC, because roles can be described by their names and are easier to visualize than policies. Lastly, the performance of authorization queries is better with RBAC, assuming the number of roles remains reasonable, because we don’t have to look up data required for a specified policy from multiple sources.

The major risk we found with RBAC was potential “role explosion”, which occurs when a huge number of roles are created to accomplish fine-grained authorization, resulting in slow lookup times and higher overall complexity. This risk is mitigated in our case, because currently our product only requires coarser-grained access control. In the end, the benefit of increased granularity provided by ABAC didn’t outweigh the cost of its increased complexity.

Mapping roles to permissions

Once we decided on using RBAC, the next question was how to map the list of available roles to authorization around our API access layer. One option was to assign API endpoints directly to each role, but this option was quickly ruled out because it would have left no room for us to extend our access control or API layer in the future. Tightly coupling the two means making changes to either would directly impact the other, a flexibility we did not want to lose.

We ended up going with a flexible permission model that introduces the concept of a “permission action”, which is an abstract, logical grouping of API endpoints. These permission actions are first-class citizens in our software architecture – all authorization systems communicate using them. Each API endpoint belongs to a permission action, and each role is assigned one or more permission actions.

LightStep [x]PM Role-based Access Control Data Model[x]PM role-based access control data model

The diagram above illustrates our RBAC data model where users have a many-to-many relationship with roles, and roles have a many-to-many relationship with permissions. Note that when performing an authorization check, we also confirm access to the API endpoint is permitted by checking the union of permission actions mapped to that user’s roles. This permissions model gives us the flexibility to support custom roles and extend our API layer in the future.

Try it out and share your feedback with us at support@lightstep.com.

Introducing the LightStep Plugin for Grafana

At LightStep, we’re always looking for ways to make it easier for our customers to monitor and analyze their application performance data in ways that fit seamlessly with their preferred workflows. This includes creating integration points to their DevOps environments.

We’re excited to release the LightStep plugin for Grafana today. Grafana, a popular open source analytics platform, consolidates metrics from different applications and databases into a single dashboard, so teams can more easily collaborate on cross-functional projects. With this new plugin, you can create time-series graphs from LightStep [x]PM’s unique tracing data and insights.

The plugin is open source and available now on GitHub.

Easy to install, our best features integrated into Grafana

The LightStep plugin is easy to install just like any other Grafana plugin. You can use the grafana-cli tool or install it manually. You can then add [x]PM graphs into your existing Grafana dashboards.

The plugin allows you to monitor latency data derived from [x]PM and lets you display it the way you want in Grafana. When you configure [x]PM graphs to show your metrics, Grafana’s editor will display your existing saved searches, and it will autocomplete as you type your queries to specify the metrics you want to focus on. In addition, you can customize the latency percentiles you want to see for your [x]PM time-series data with any value, so you can tailor your system performance monitoring to your business objectives or customer requirements.

Create graphs in Grafana using your existing saved searches and set custom latency percentilesCreate graphs in Grafana using your existing saved searches and set custom latency percentiles

When you notice an unexpected error or unusual fluctuation in your performance data within Grafana, you can drill down to specific example traces related to the anomalous behavior in [x]PM. You can then view detailed information from those specific traces, such as log payloads and bottlenecks in the underlying transactions, to quickly identify the root cause of the issue.

The LightStep plugin lets you monitor your application’s performance data using your existing Grafana dashboardsThe LightStep plugin lets you monitor your application’s performance data using your existing Grafana dashboards

Many of our customers have already told us that this plugin will help them get even more value from [x]PM, because they can now see all of the performance data our product provides in the same context and alongside their other business-critical metrics.

Try it out and share your feedback with us at support@lightstep.com.

Enhancements in [x]PM Monitoring

Today, we’re announcing a suite of new features that make LightStep [x]PM monitoring both more powerful and easier to understand.

Our customers often tell us they’re able to diagnose their systems’ problems much faster with LightStep than other tools. One key reason for this is LightStep’s ability to quickly and accurately monitor latency statistics and error rates. However, many customers also ran into limits with our SLAs: only one latency SLA per saved search, and one error rate SLA.

We’re happy to announce that we’re removing those limits. SLAs are now called Conditions, and they have powerful new capabilities.

Unlimited conditions

Monitor any number of conditions for each saved searchMonitor any number of conditions for each saved search

You can now have any number of conditions in LightStep – create three latency conditions for different latency percentiles, different error conditions to notify different groups, or send notifications to different places depending on the severity of the condition. You can write monitoring conditions that are flexible and fit in with your team’s workflow.

New signals to monitor

Monitor latency, operation rate, or error rateMonitor latency, operation rate, or error rate

In addition to monitoring the latency and error rate in your saved searches, you can now monitor the operations-per-second rate for a saved search. This is useful if you need to monitor whether a service is online and the load is in line with your expectations.

You can also monitor whether any value is above or below a threshold. This is great for setting bounds on operation rates, so you get notified if traffic is significantly above or below your expectations. Operation rates, error rates, and latency percentiles can all have “greater than” and “less than” conditions.

Monitor over any time range

LightStep [x]PM monitoring has been tuned for quick response to signals, and by default, it monitors the last two minutes of data from your system. We’ve learned from you that sometimes you’d prefer to monitor the last hour, day, or even last week of performance data for some systems. LightStep [x]PM monitoring now lets you control the evaluation window of your conditions, giving you flexibility between extremely sensitive monitoring and the ability to detect more sustained trends.

Unlimited notifications

When a condition is violated, you can now send notifications to as many Slack, PagerDuty, or Webhook destinations as you like, rather than one to each.

Send notifications to PagerDuty, Slack, or WebhooksSend notifications to PagerDuty, Slack, or Webhooks

Snooze notifications

You can temporarily add a “snooze” for any condition to prevent it from sending notifications. This can be useful when experimenting with new condition settings or during a pager storm. We’ll keep track of who applied the snooze and when, so you’ll always know “why didn’t I get paged?”

Getting started with monitoring

All your existing SLAs work the same as they did before, and you can find them in the “Monitoring” section of the LightStep navigation bar, represented by the bell. You can create new Conditions for a saved search on the “Saved Search” page by opening the “Conditions” drawer on the right-hand side.

Monitoring is under the Bell icon on the left-hand sideMonitoring is under the Bell icon on the left-hand side

We’re excited to give you improved and more powerful monitoring in LightStep [x]PM, so you can go even faster from an alert to a root cause. Give it a try and share your feedback with us.

Announcing LightStep: A New Approach for a New Software Paradigm

(Image Credit: daneden.me)

Today, LightStep emerged from stealth, announced its first product, LightStep [x]PM, as well as its Series A and Series B funding.

With today’s launch, we’re excited to speak more openly about what we’ve been up to here at LightStep. As a company, we focus on delivering deep insights about every aspect of high-stakes production software. With our first product, LightStep [x]PM, we identify and troubleshoot the most impactful performance and reliability issues. This post is about how we got here and why we’re so excited.

I started thinking about this problem in 2004. It began during an impromptu conversation I had with Sharon Perl, a brilliant research scientist who came to Google in the early days. She was mainly working on an object store (à la S3) at the time but also had a few prototype side projects. We talked through five of them, I believe, but one captured my attention in particular: Dapper.

Dapper circa 2004 was not fully baked, though the idea was magical to me: Google was operating thousands of independently-scalable services (they’d be called “microservices” today), and Dapper was able to automatically follow user requests across service boundaries, giving developers and operators a clear picture of why some requests were slow and others ended with an error message. I was so enamored of the idea that I dropped what I was doing at the time, adopted the (orphaned) Dapper prototype, and built a team to get something production-ready deployed across 100% of Google’s services. What we built was (and is still) essential for long-term performance analysis, but in order to contend with the scale of the systems being monitored, Dapper only centrally recorded 0.01% of the performance data; this meant that it was challenging to apply to certain use cases, such as real-time incident response (i.e., “most firefighting”).

Ten years later, Ben Cronin, Spoons (Daniel Spoonhower), and I co-founded LightStep. Enterprises are in the midst of an architectural transformation, and the systems our customers and prospects build look a lot like the ones I grew up with at Google. We visit with enterprise engineering and ops leaders frequently, and what we see are businesses that live (or die) by their software, yet often struggle to stay in control of it given the overwhelming scale and complexity of their own systems.

We built LightStep to help with this, and we started with LightStep [x]PM to focus on performance and reliability in particular. Our platform is not a reimplementation of Dapper, but an evolution and a broadening of its value prop: with LightStep’s unconventional architecture, we can analyze 100.0% of transaction data rather than 0.01% of it like we did with Dapper. This unique – and technically sophisticated – approach gives our customers the freedom to focus on the performance issues that are most vital to their business and jump to the root cause with detailed end-to-end traces, all in real-time.

For instance, Lyft sends us a vast amount of data – LightStep analyzes 100,000,000,000 microservice calls every day. At first glance, that data is all noise and no signal: overwhelming and uncorrelated. Yet by considering the entirety of it, LightStep can measure how performance affects different aspects of Lyft’s business, then explain issues and anomalies using end-to-end traces that extend from their mobile apps to the bottom of their microservices stack. The story is similar for Twilio, GitHub, Yext, DigitalOcean, and the rest of our customers: they run LightStep 100% of the time, in production, and use it to answer pressing questions about the behavior of their own complex software.

The credit for what LightStep has accomplished goes to our team. We value technical skill and motivation, of course; that said, we also value emotional sensitivity, situational awareness, and the ability to prioritize and leverage our limited resources. LightStep will continue to innovate and grow well into the future, and the people here and their relationships with our inspiring customers are the reason why. The company has also benefited in innumerable ways from early investors Aileen Lee and Michael Dearing, the staff at Heavybit, and of course our board members from Redpoint and Sequoia, Satish Dharmaraj and Aaref Hilaly. Our board brings deep company-building experience as well as a humility and humor that we don’t take for granted.

It’s no secret that software is getting more powerful every day. As it does, it becomes more complex. LightStep exists in order to decipher that complexity, and ultimately to deliver insights and information that let our customers get back to innovating. Nothing gets us more excited than the success stories we hear from our customers. As we continue to build towards our larger vision, we look forward to hearing many more.