LightStep: a short-as-possible overview

LightStep is a monitoring product built specifically for modern distributed systems. It uses distributed tracing to reduce the mean-time-to-resolution for anomalies and errors in both dev and production, and at any layer of the stack, including mobile and web clients.

Document goals

Document non-goals

  • Describe the source code instrumentation process in any detail
  • Show high-fidelity, detailed walkthroughs of the UI (which is an ever-improving and ever-changing target)

When to use LightStep?

LightStep is great for the following situations:

  • Root-cause analysis for anomalies in dev and production, especially those that involve multiple processes
  • Connecting the dots from mobile and web clients to the services that support them
  • Debugging slow requests from the browser that initiated them (in dev mode)
  • Making sense of queueing problems and asynchronous programming models (e.g., the Node event loop)
  • Exposing the “next level of detail”: LightStep’s underlying technology enables a particularly verbose approach to instrumentation

Important product views

Latest Traces

Select the “Latest Traces” link in the navigation bar at the top-left of the window…

You’ll see a listing of the most recently finished Spans for your project. You can refine the results via parametric queries…

By following the link in the “Duration” column, LightStep will assemble the entire distributed trace that includes the given Span. Note that the “Latest Traces” data is transient, and after the data ages out of the LightStep collectors, it will no longer be possible to retroactively assemble such traces.

Viewing a specific trace

After clicking on a specific duration in the Latest Traces view (or elsewhere in LightStep), you’ll be brought to the “trace view”. This is essentially a timeline illustrating the execution of some distributed trace…

Depending on whether your instrumentation established parent-child relationships, you may be able to expand and contract parent-child relationships in the traces. In any case, they will render each component of your system in a different color. The individual Span bars can be expanded to show associated logs…

Tracing your current pageview with the LightStep Overlay

When a LightStep developer uses LightStep (meta!!!), we see the “LightStep Overlay” in the bottom-right corner of the screen…

Clicking through on that overlay brings you to a Latest Traces view that shows Spans originating in the overlay’s own browser session. In this way, if you feel like your most recent page view took too long to load, the LightStep Overlay allows you to understand why.

Integrating the LightStep overlay is a one-liner if you’ve already enabled LightStep+OpenTracing in your web client.

Continuous monitoring via tracked operations

The above use cases focus on specific cherry-picked Spans from Latest Traces or the active browser session. Monitoring in production should happen continuously, though, and LightStep makes this easy.

To track a new operation, first find it in the “untracked operations” list at the bottom of the Operations page…

Once you refresh the page, the tracked operation will appear, though there’s not yet any history since LightStep wasn’t aware of it previously…

You can get a closer look at any tracked operation by selecting it from the list on the Operations page…

From here, you can select a time range of interest and see a list of traces matching both (a) the selected time window, and (b) falling into specific latency buckets or representing Spans that resulted in errors…

You can also define an SLA by selecting the pencil icon near the “SLA” header. Once you’ve defined an SLA, LightStep can notify you about any violations via alerts.

Alerting

You can alert on any tracked operation SLA violation via LightStep integrations with Slack (instructions) and/or PagerDuty (instructions).

LightStep alert notifications are particularly powerful in that they are immediately actionable. They contain links to specific examples of traces that violated the SLA (either through high latency or errors)…