Understanding Tracer Performance with LightStep Benchmarks

One of my favorite things about LightStep is helping to mentor the next generation of developers through our internship programs. These students bring clear eyes and novel solutions to technically challenging areas of our product. Over the summer, my team had the opportunity to work with one of our interns on an area of the product that’s near and dear to all of our hearts here — benchmarking and profiling for high-performance tracing. As a part of this work, we’ve produced an experimental, high-performance, streaming tracer for Python using native bindings to our C++ libraries along with a benchmarking suite to simulate extremely high-throughput tracing scenarios.

We’ve always had a strong commitment to code quality, but it can be challenging to test high-throughput scenarios for our tracing libraries, as the nature of “high throughput” can differ depending on the exact topology of a system being traced. In order to quantify the changes we were making in the experimental library, we built a new tool which simulates a variety of real-world scenarios: a process handling tens of thousands of requests per second, a CPU under maximum utilization, or even inconsistent network connectivity between a process and a LightStep Satellite. One guiding principle behind developing this benchmark suite was to ensure that we’re testing in production-like conditions.

Creating a benchmarking suite that accurately models ‘production systems’ was instructive not only for Isaac, but also for us. Having someone with an outside perspective asking questions forces you to re-evaluate your existing preconceptions, and also makes you double-check the things you already ‘know’. An early version of the benchmarking tool, for example, used a simpler algorithm –

for i in range(1000):
    with tracer.start_active_span():
        pass

What’s wrong with this? We’re generating a bunch of spans and seeing how long it takes to process them, right? Well, real spans generally have other stuff attached to them like logs, and tags. Moreover, a tight loop like this benefits from local caching a great deal which artificially speeds up the creation of spans and moves us further away from a production-like scenario. In lieu of this, the tool was modified to benchmark what effectively became self-contained client processes –

<span style="font-weight: 400;" data-mce-style="font-weight: 400;">while</span><span style="font-weight: 400;" data-mce-style="font-weight: 400;"> curTime &lt; endTime:</span>
    with tracer.start_active_span() as span:
        span.add_logs()
        span.add_tags()
        do_work()
    sleep(rand_interval)

These processes can be run for an arbitrary amount of time with a random sleep, giving us a less synthetic and more realistic set of performance data.

Creating this benchmark suite was an important first step in quantifying the results of our experimental high-performance tracer suites. As part of our commitment to being a best-in-breed tracing platform, we need to ensure that we offer a variety of tracers that can be used in a variety of applications. One such example is our Python tracer. Python is a popular, general purpose programming language. It’s becoming increasingly common, however, to see Python deployed in a variety of high-performance computing situations powering features such as data analysis, recommendation engines, and other machine learning tasks. In these circumstances, it can be beneficial to sacrifice some of the ease-of-use of a pure Python tracer and use native bindings to C/C++ libraries to gain increased performance. We recently shipped a new streaming protocol in our C++ tracer that allows for forwarding trace data to multiple Satellites concurrently, using non-blocking writes. Memory management is key to this strategy, ensuring that we only copy a span a single time during the reporting process, and we forward spans as they’re generated to avoid excessive allocations. The benchmarking suite allowed us to not only view the results of this work, but to compare it to our existing pure Python tracer.

LightStep Spans Per Second Vs Tracer CPU Usage

This chart demonstrates a sample of the results, showing the difference between the experimental tracer and the standard one. At around 5000 spans per second, the streaming tracer offers a 10x reduction in CPU usage of the tracer – an impressive performance difference! Keep in mind, however, that this isn’t suitable for all deployment configurations. Certain serverless or containerized deployments will have difficulty using the native bindings. You can try it out for yourself by installing the lightstep-streaming package from PyPi and modify your tracer configuration. Keep in mind, this is still an experimental package!

Where do we go from here? Our next step is to continue to build on the foundation of the benchmarking package by donating it to the OpenTelemetry project, ensuring that the future of telemetry instrumentation packages are built to prioritize performance in all languages. We’d also love to see OpenTelemetry offer native bindings for Python and other languages where it makes sense, and are interested in how these high-performance options are adopted.

Ultimately, telemetry data should be collected without causing unacceptable levels of overhead for your service, and we’re excited to push the state-of-the-art forward for the observability community as a major contributor to OpenTelemetry. I’d like to thank Isaac for the research that has been gathered which went into this blog, and for all of our other interns who continue to make real customer impacting improvements to our product. Our 2020 program is accepting applicants now — if you’d like to be part of the team, get in touch!

OpenTelemetry 101: What is an Exporter?

OpenTelemetry is an open-source observability framework for generating, capturing, and collecting telemetry data for cloud-native software. Prior posts in this series have covered the definition of observability, as it applies to OpenTelemetry, and a dive into the tracing and metrics API. There’s a third critical component, though, that you’ll want to understand before you get started using OpenTelemetry, and that’s how to actually get data out of it! In this post, we’ll talk about the OpenTelemetry exporter model and the OpenTelemetry Collector, along with several basic deployment strategies.

Note: Some of the information in this post is subject to change as the specification for OpenTelemetry continues to mature.

To understand how OpenTelemetry’s exporter model works, it is useful to generally understand a little bit about how instrumentation is integrated to service code. Generally, you can have instrumentation at three different points: Your service, its library dependencies, and its platform dependencies. Integrating at the service level is fairly straightforward, as you would declare a dependency in your code on the appropriate OpenTelemetry package and deploy it with your code. Library dependencies are similar, except that your libraries would generally only declare a dependency on the OpenTelemetry API. Platform dependencies are a more unusual case. When I say ‘platform dependency’, what I mean are the pieces of software you run to provide services to your service, things like Envoy and Istio. These will deploy their own copy of OpenTelemetry, independent of your actions, but will also generally emit trace context that your service will want to be a part of.

In every case, the trace and metric data that your service or its dependencies emit are of limited use unless you can actually collect that data somewhere for analysis and alerting. The OpenTelemetry component responsible for batching and transporting telemetry data to a backend system is known as an exporter. The exporter interface is implemented by the OpenTelemetry SDKs, and uses a simple plug-in model that allows for telemetry data to be translated to whatever format a backend system requires, and transmit it to that backend system. Exporters can be composed and chained together, allowing for common functionality to be shared (like tagging data before export, or providing a queue to ensure consistent performance) across multiple protocols.

To put this in more concrete terms, let’s compare OpenTelemetry to OpenTracing. In OpenTracing, if you wanted to switch what system you were reporting data to, you’d need to replace the entire tracer component with another – for example, swapping out the Jaeger client library with the LightStep client library. In OpenTelemetry, you simply need to change the export component, or even just add the new one and export to multiple backend systems simultaneously. This makes it a lot easier to try out new analysis tools or send your telemetry data to different analysis tools in different environments.

While the exporter model is very convenient, there are instances when you don’t have the ability to actually redeploy a service in order to add a new exporter. In some organizations, there’s a disconnect between the people writing the instrumented code and the people running the observability platform, which can impair the velocity of rolling out changes to where data goes. In addition, some teams may prefer to abstract the entire exporter model out from their code, and into a separate service. This is where the OpenTelemetry Collector comes in useful. The collector is a separate process that is designed to be a ‘sink’ for telemetry data emitted by many processes, which can then export that data to backend systems. The Collector has two different deployment strategies – either running as an agent alongside a service, or as a remote application. You’d generally think about using both: the agent would be deployed with your service and run as a separate process, or in a sidecar. The collector would be deployed separately, as its own application running in a container or virtual machine. Each agent would forward telemetry data to the collector, which could then export it to a variety of backend systems such as LightStep, Jaeger, Prometheus, and more.

LightStep OpenTelemetry Exporter

Regardless of how you choose to instrument or deploy OpenTelemetry, exporters provide a lot of powerful ways to report telemetry data. You can directly export from your service, you can proxy through the collector, or you can aggregate into standalone collectors – or even a mix of these! Ultimately, what’s important is that you’re getting that telemetry data into an observability platform that can help you analyze and understand what’s going on in your system.

OpenTelemetry 101: What Are Metrics?

OpenTelemetry is an open-source observability framework for generating, capturing, and collecting telemetry data for cloud-native software. In the previous posts in this series, I discussed what observability is, as it relates to OpenTelemetry, and what tracing is. Today, I’d like to cover the second major area of the OpenTelemetry API: metrics.

OpenTelemetry’s metrics API supports reporting diagnostic measurements from a service using the three basic kinds of instruments. These instruments are commonly known as counters, gauges, and measures. Developers use these objects in order to gain visibility into operational metrics about their service.

Most developers are familiar with metrics in some fashion. It’s extremely common, for instance, to monitor metric values such as process memory utilization or error rate, and to create alerts to indicate when a service is violating a predetermined threshold. In addition to these common measurements, metrics events streaming to these instruments can be applied in other unique ways, including by being aggregated and recorded by tracing or logging systems. With that in mind, let’s look at the instruments available through the OpenTelemetry Metrics API and discuss how they can be used.

The API distinguishes between the metric instruments available through their semantic meaning rather than the eventual type of the value they export. This is somewhat unconventional, and it stems from the design of OpenTelemetry itself — the separation between the API and the SDK forces the SDK to ultimately determine what should happen with any specific metric event, and could potentially implement a given instrument in a non-obvious or non-standard way. If you’re familiar with existing metrics APIs (such as the Prometheus API), this explains why there’s no method to export a histogram or summary distribution. These are considered to be measures, and the SDK can be configured to export a histogram or a summary from a measure.

From a developer’s point of view, the exact backing instrument is somewhat opaque by design, as the API is designed for portability. Developers use the Add, Set, and Record methods with optional declarations to set restrictions on the specific measure (for example, to allow a counter to support positive and negative values) and the SDK takes care of the rest. That said, which of these should you use for any given scenario?

If you’re trying to record a count of something — such as the sum of errors over time — then you should use the Add method, which is supported by a counter. A counter is, by default, monotonic, which means it only expects positive values. This property allows a counter by default to be interpreted as a rate measurement. Rates are a comparison between two quantities — bytes transferred per second, for example. Non-monotonic counters are also supported, which can be useful for reporting changes in a quantity (like the number of elements in a set as they’re added and removed).

What about circumstances where you need a count, but not a quantity? For example, if you’re measuring something over an arbitrary interval or you don’t care about the rate of change, just the current value? This is when you would use a gauge, by calling Set. You can think about a gauge like a thermostat, or a ratio (such as memory consumption vs. total memory available). As you may expect, the default for these instruments is non-monotonic, but a monotonic option is available as well. Why? To report a change, versus a sum. If you’re trying to report a sum you should use a gauge, but if you’re reporting increments then you should use a counter.

Finally, what should you use when you care about individual measurements? Consider a circumstance where you want to know both the count and sum of a given event, or when you’re interested in grouping events into quantiles? You would use Record for these values, as it expresses a measure. A common application of measures is to create a histogram or summary of values. You could use these to calculate averages, like the average size of a particular response, or the average duration of HTTP requests.

This is getting rather long, so let’s do a quick review. There are three instruments you can use in OpenTelemetry, each defined by the method you call to send a metric event. They are:

  • Counters, which you Add a value to. These are good for values that you’d like to think of as a rate, or changes in a quantity.
  • Gauges, which you Set the value of. You can think of these as either a car’s odometer (a monotonic gauge, it never decreases) or a car’s speedometer (a non-monotonic gauge, as it can go up and down.)
  • Measures, to which you Record a value. These are useful to build histograms or summaries, metric projections that let you calculate averages of many values.

The exact mechanism by which you’ll use each of these instruments is a bit language-specific — OpenTelemetry, by design, is trying to allow each language SIG to implement the API in a way that is conventional for the language it’s being implemented in. This means the exact details of creating a new metric event may not match the specification precisely, you should consult the documentation for your particular OpenTelemetry implementation for more information. At a high level, however, here’s how it works.

First, you’ll need to create an instrument of the appropriate kind, and give it a descriptive name. Each instrument name needs to be unique inside its process. You can also provide label keys, which are optional key values that are used to optimize the metric export pipeline. You’ll also need to initialize a LabelSet, which is a set of labels (both keys and values) that correspond to attributes being set on your metric events. What does this look like?

// in our init or in middleware...
requestKeys = ["path", "host"] // label keys

// now we initialize instruments for our keys, a Count and a Measure
requestBytes = NewIntCounter("request.bytes", WithKeys(requestKeys), withUnit(unit.Bytes))
requestLatency = NewFloatMeasure("request.latency", WithKeys(requestKeys), withUnit(unit.Second))

// then, define the labels for each key (and add more if required)
labels = meter.DefineLabels({“path”: “/api/getFoo/{id}”, “host”: “host.name”})

Again, the specifics are going to be slightly different for each language, but the basic gist is the same — early on, you create actual metric instruments by giving them a name, and telling them what keys they’ll be seeing. After that, either explicitly add labels for your instruments (or get a blank set of labels, if appropriate). Once you’ve accomplished this, actually recording metric events is fairly straightforward.

requestBytes.Add(labels, req.bytes)
requestLatency.Record(labels, req.latency)

Another option, for high-performance scenarios, is to utilize a handle. This effectively skips a step, by precomputing the instrument and labels (after all, you can add arbitrary labels to a label set, and calling an instrument through the label set will add that event to each combination of label + instrument), useful if you’re in a performance-critical section of code.

requestBytesHandle = requestBytes.GetHandle(labels)
requestLatencyHandle = requestLatency.GetHandle(labels)

for req in requestBatch {
	...
	requestBytesHandle.Add(req.bytes)
	requestLatencyHandle.Record(req.latency)
	...
}

One thing to remember with handles, however, is that you’re responsible for ‘cleaning up’ after you’re done by freeing the handle and deallocating it!

So, to summarize one more time:

  • Metric events are recorded through instruments.
  • You need to create a unique instrument for each metric you want to record, and give it a unique name.
  • You can apply additional metadata to your instrument through labels.
  • Recording metric events is performed either by calling the appropriate method on the instrument itself, or by getting a precomputed handle and calling the method there.

Hopefully, this has given you a better understanding of how the OpenTelemetry metrics API functions. It’s a lot to take in! The real power of OpenTelemetry, however, isn’t just that it provides a tracing and metrics API for you to use. In my next post, I’ll cover the exporter model, and the OpenTelemetry Collector, which are the ways you’ll get data out of OpenTelemetry and into analysis systems such as LightStep.

OpenTelemetry 101: How to Start Contributing

The strength and durability of any open source project is, quite often, measured by its contributor base. Projects with a large and active community – not just users, but also individuals who give back by opening issues, fixing bugs, writing documentation, and so forth – are generally ones that produce higher quality software. If you’ve been reading about OpenTelemetry then you may be asking yourself how you can get involved. It’s a great question, and it’s also a great time to dig in and start making pull requests during Hacktoberfest (you can get a free t-shirt!), so let me tell you the best way to get started.

Join The Discussion

Nearly all communication related to OpenTelemetry is public, and if you want to get started contributing, then it’s easy to join in the discussion. Every SIG (Special Interest Group) has regular meetings, all of which are published on a public calendar (links to which are as follows: gCal, iCal, web). Recordings of past SIG meetings can be found on our YouTube channel. Want to just dip your toe in on the community in general? Everyone is encouraged to join in the monthly meetings, held on the second Wednesday of every month. You can also join the mailing list to keep track of discussions and questions. The mailing list will publish announcements, project updates, and other important news in an easy-to-follow format. If you just want to chat, check out our Gitter!

Honestly, if all you feel like you can contribute is your awareness, then that’s great! Being aware of what’s happening in the project and making your voice heard about decisions like the API and design of the SDK are valuable – we love hearing from people with experience instrumenting their code for telemetry, or running observability systems. Your expertise not only matters, it can be used to help push the state of the art forward.

Join A SIG

If you’re interested in participating more concretely, then you should think about joining a SIG. We organize our community members that want to work on specific aspects of the project into SIGs by language, or area of interest. For example, if you wanted to help implement the OpenTelemetry API specification into code, you could join the Java or .NET SIG. If you want to debate the finer points of the cross-language specification, then the Specification SIG might be more your speed. If you want to support the community by maintaining the website, then there’s a SIG for that too!

There’s nothing special that you need to do to join a SIG – just drop in to a weekly meeting and go from there!

Your First Contribution

If you’re looking to contribute, then the easiest way to jump in is by doing. Find a repository in the OpenTelemetry organization that you’d like to make a commit to, then make a fork of it into your account. Want to know what to work on? Check the issues tab and filter the list by labels such as “up for grabs” or “good first issue”. These will be some places you can start contributing immediately! Be sure to post in the issue and let a maintainer know that you’re taking the issue for yourself so they can update the issue accordingly.

If you want to add new functionality, it’s generally a good idea to create an issue and discuss it with that SIG’s maintainers first, to make sure that your idea is in line with the overall project goals. You can also create issues in order to report bugs, make suggestions, or give feedback to the SIG maintainers.

You don’t have to limit yourself to contributing code, though. The project needs more examples, sample code, documentation, and other resources to make it as easy to use as possible. You can find issues for building documentation in the website repository that are ready-to-go for Hacktoberfest, with more to come. Finally, you can contribute by communicating about OpenTelemetry itself. Tell your friends! Tell your coworkers! We’re extremely excited about our alpha releases that are coming out over the next few weeks, and would love to get more users to look at them and give their impressions.

So, again, if you want to contribute to OpenTelemetry, there’s three things I’d suggest:

  • Subscribe to the mailing list, join the Gitter, and follow/star the project on GitHub.
  • Check out the public calendar and attend SIG meetings, and/or the monthly community meeting.
  • Open an issue, make a PR, write a tweet, or create some samples that demonstrate how to use OpenTelemetry – anything, and everything, is appreciated!

OpenTelemetry 101: What Is Tracing?

OpenTelemetry is an open-source observability framework for generating, capturing, and collecting telemetry data for cloud-native software. In my previous post, I covered what observability means for software, and talked about the three types of telemetry data that comprise observability signals — tracing, metrics, and logs. Today, I’ll be taking a deeper look at the first of these, tracing.

When we refer to tracing in OpenTelemetry, we’re generally referring to distributed tracing (or distributed request tracing). Traditionally, tracing is a low-level practice used to profile and analyze application code by developers through a combination of specialized debugging tools (such as dtrace on Linux or ETW on Windows) and programming techniques. By contrast, distributed tracing is an application of these techniques to modern, microservice-based architectures.

Microservices introduce significant challenges to tracing a request through an application, thanks to the distributed nature of microservices deployments. Consider a traditional monolithic application: since your code is centralized onto a single host, diagnosing a failure can be as simple as following a stack trace. When your application consists of tens, hundreds, or thousands of services running across many hosts, you can’t rely on a single stack trace – you need something that represents the entire request as it moves from service to service, component to component. Distributed tracing solves this problem, providing powerful capabilities such as anomaly detection, distributed profiling, workload modeling, and diagnosis of steady-state problems.

Much of the terminology and mental models that we use to describe distributed tracing can trace their origin to systems such as Magpie, X-Trace, and Dapper. Dapper, particularly, has been highly influential to modern distributed tracing, and many of the mental models and terminology that OpenTelemetry uses can trace their origin to there. The goal of these distributed tracing systems was to profile requests that moved across processes and hosts, and generate data about those requests suitable for analysis.

The above diagram represents a sample trace. A trace is a collection of linked spans, which are named and timed operations that represent a unit of work in the request. A span that isn’t the child of any other span is the parent span, or root span, of the trace. The root span, typically, describes the end-to-end latency of the entire trace, with child spans representing sub-operations.

To put this in more concrete terms, let’s consider the request flow of a system that you might encounter in the real world, such as a ride sharing app. When a user requests a ride, multiple actions begin to take place – information is passed between services in order to authenticate and authorize the user, validate their payment information, locate nearby drivers, and dispatch one of them to pick up the rider. A simplified diagram of this system, and a trace of a request through it, appears in the following figure. As you can see, each operation generates a span to represent the work being done during its execution. These spans have implicit relationships (parent-child) both from the beginning of the entire request at the client, but also from individual services in the trace. Traces are composable in this way, where a valid trace is comprised of valid sub-traces.

Each span in OpenTelemetry encapsulates several pieces of information, such as the name of the operation it represents, a start and end timestamp, events and attributes that occurred during the span, links to other spans, and the status of the operation. In the above diagram, the dashed lines connecting spans represent the context of the trace. The context (or trace context) contains several pieces of information that can be passed between functions inside a process or between processes over an RPC. In addition to the span context, identifiers that represent the parent trace and span, the context can contain other information about the process or request, like custom labels.

One important feature of spans, as mentioned before, is that they encapsulate other information. Much of this information is required – the operation name, the start and stop timestamps, for example – but some is optional. OpenTelemetry offers two data types, Attribute and Event which are incredibly valuable as they help to contextualize what happened during the execution measured by a single span. Attributes (known as tags in OpenTracing) are key-value pairs that can be freely added to a span to help in analysis of the trace data. You can think of attributes as data that you’d like to eventually aggregate or use to filter your trace data, like a customer identifier, process hostname, or anything else you can imagine. Events (known as logs in OpenTracing) are time-stamped strings that can be attached to a span, with an optional set of Attributes that further describe it. OpenTelemetry additionally provides a set of semantic conventions of reserved attributes and events for operation or protocol specific information.

Spans in OpenTelemetry are generated by the Tracer, an object that tracks the currently active span and allows you to create (or activate) new spans. Tracers are configured with Propagator objects that support transferring the context across process boundaries. The exact mechanism of creating and registering a tracer is dependent on your implementation and language, but you can generally expect there to be a global Tracer capable of providing a default tracer for your spans, and/or a Tracer provider capable of granting access to the tracer for your component. As spans are created and completed, the tracer dispatches them to the OpenTelemetry SDK’s Exporter, which is responsible for sending your spans to a backend system for analysis.

To recap, let’s summarize:

  • A span is the basic building block of a trace. A trace is a collection of linked spans.
  • Spans are objects that represent a unit of work, which is a named operation such as the execution of a microservice or a function call.
  • A parentless span is known as the root span or parent span of a trace.
  • Spans contain attributes and events, which describe and contextualize the work being done under a span.
  • A tracer is used to create and manage spans inside a process, and across process boundaries, through propagators.

In my next post in this series, I plan to discuss the OpenTelemetry metrics data source, and how it interacts with the traces. Stay tuned!

OpenTelemetry 101: What Is Observability?

By now, you’ve likely heard about OpenTelemetry, an open source observability framework created by the merger of OpenTracing and OpenCensus. You may be asking yourself, however, “what’s observability, anyway?” It’s a valid question – it’s a topic that’s been in the news a lot recently, and it seems that every application monitoring vendor is trying to rebrand as an ‘observability’ vendor. In this series of blog posts, I’ll demystify observability and explain the concepts you need to know in order to understand OpenTelemetry, and why it matters.

Observability as a term stems from control theory, an engineering discipline that concerns itself with how to keep dynamic systems in check. An applied example of control theory can be seen in cruise control for cars – under constant power, your speed would decrease as you drive up a hill. Instead, in order to keep your speed consistent, an algorithm increases the power output of the engine in response to the measured speed. This is also an application of observability – the cruise control subsystem is able to infer the state of the engine by observing the measured output (in this case, the speed of the car).

In software, observability is a bit more prosaic, referring to the telemetry produced by services in an application. This telemetry data can be divided into three major forms:

  • Traces: contextual data about a request through a system.
  • Metrics: quantitative information about processes such as counts and gauges.
  • Logs: specific messages emitted by a process or service.

Historically, these three verticals have been referred to as the “three pillars” of observability. The growing scale and complexity of software have lead to changes in this model, however, as practitioners have not only identified the interrelationships between these types of telemetry data, but coordinated workflows involving them.

For example, time-series metrics dashboards can be used to identify a subset of traces that point to underlying issues or bugs. Log messages associated with those traces can identify the root cause of the issue. When resolving the issue, new metrics can be configured to more proactively identify similar issues before the next incident.

The ultimate goal for OpenTelemetry is to ensure that this telemetry data is a built-in feature of cloud-native software. This means that libraries, frameworks, and SDKs should emit this telemetry data without requiring end-users to proactively instrument their code. To accomplish this, OpenTelemetry is producing a single set of system components and language-specific libraries that can create, collect, and capture these sources of telemetry data and export them to analysis tools through a simple exporter model.

In summary, observability in software is about the integration of multiple forms of telemetry data which together can help you better understand how your software is operating. It is unique from traditional application monitoring because it focuses on the integration of multiple forms of telemetry data, and the relationships between them. Observability doesn’t just stop at the capture of telemetry data, however — the most critical aspect of the practice is what you do with the data once it’s been collected. This is where a tool like LightStep comes in handy, providing features such as correlation detection, historical context, and automatic point-in-time snapshots through unparalleled analysis of your telemetry data.

In the next part of this series, we’ll take a deeper dive into telemetry data sources, starting with tracing.

Updates to Developer Mode: New Instrumentation and Debugging Tools for Tracing

Developing modern, microservice-based applications can be challenging. When we launched Developer Mode last month, we were interested in seeing how providing a stream of structured trace data could make it easier to instrument and write applications.

What We Heard

People who have tried developer mode really like it! The ability to quickly start tracing your applications — and get real-time feedback on your tracing instrumentation — is super valuable. What we saw and heard from users who had more complex applications, though, was that it was very challenging to filter through the firehose of data that multiple services could generate. With that in mind, we’ve made some changes to Developer Mode to support everyone from first-time users to customers with large existing applications.

What’s New?

You’ll now be able to see a list of services reporting to your developer satellite in the sidebar, along with a count of spans that you’ve sent. Clicking on these service names will filter to a view that only contains spans from that service. In addition, we’ve added the ability to search for operations in the event stream — great for narrowing down your view to locate just the spans you’re looking for.

Developer Mode 1.1: New Instrumentation and Debugging Tools for Tracing - screenshot


Ever wanted to have someone else on your team look at a trace in Developer Mode? You now can! The URL for a Developer Mode trace is now shareable with other users in your organization.

Ready To Go?

These new features should make Developer Mode more powerful for everyone, regardless of where you are in your tracing journey. Not already using Developer Mode? Head to LightStep Play to get started! Questions or comments? Drop us a line at hello@lightstep.com, we’ll be in touch!

Distributed Tracing for .NET with LightStep [x]PM

We’re excited to announce the availability of the LightStep Tracer for .NET, now in early access. Application developers who work with Microsoft’s .NET Framework 4.5+ and .NET Core can now use this tracer to instrument and observe their applications. This tracer is compatible with applications written using C# or F#, and it enables developers to use open source integrations for quickly instrumenting popular technologies such as gRPC or ASP .NET Core 2.

Why .NET?

You can find .NET powering many of the world’s top million websites, and ASP.NET Core has grown from its launch just over two years ago to become the fourth most commonly used web framework. .NET Core takes a cloud and container-first mindset making it an ideal choice for microservice development and deployment. We’re incredibly excited to take the first step towards offering .NET developers the opportunity to see the benefits of LightStep [x]PM and distributed tracing such as reduced MTTR, best-in-class observability, and the ability to profile their application performance in production.

LightStep [x]PM - Distributed Tracing for .NET
Monitoring a .NET Core application in LightStep [x]PM

Popular integrations

The tracer fully supports the current OpenTracing API for C#, which means it can be used with community-supported contributions. One good example is the .NET Core Instrumentation for OpenTracing which extends the .NET DiagnosticSource module with enhanced instrumentation for ASP.NET Core, Entity Framework Core, and .NET base class libraries such as HttpClient in order to get an existing or new project up to speed quickly with tracing instrumentation. In addition, you can use ASP .NET Action Filters to quickly build your own ASP.NET tracing using ASP .NET MVC 4 or 5.

Next steps

While this tracer is currently in early access, we’re excited about getting it into the hands of developers and organizations using .NET, so we can get feedback and feelings about this new addition. Over the next two months, we’ll be gathering information and analyzing usage to prepare for a full release in early 2019. If you’re already a LightStep[𝑥]PM customer and you’d like to start using the new tracer, please get in touch with us to learn more or check it out on GitHub.

opentracing.io v2 Released – Learn About Distributed Tracing & Get Involved

This article originally appeared on the OpenTracing blog.

We’re happy to announce that opentracing.io has been updated to version 2.0! This is the first major update to the OpenTracing website since it was created, and I’d like to go over what the changes mean for our community. First, let me extend my warmest thanks to all of the community members who helped with getting v2 launched – we truly could not have done it without you. That said, let’s take a tour of what’s new.

Site walkthrough

The first thing you’ll notice on the new site is that we reorganized our information to make it more approachable and consistent. Overall, one of the biggest goals in the site redesign was to structure information in a more accessible and easily understood way. To that end, we focused on the following four use cases:

    • Quick Starts and Tutorials: Working code that you can drop in to begin using OpenTracing.
    • Overviews: High-level, conceptual overviews of OpenTracing and its components.
    • Best Practices: Practical insights into the application of OpenTracing for real-world design challenges.
    • Guides: In-depth, language-specific usage manuals for OpenTracing that map general concepts to specific, actionable examples.

Let’s talk about what you’ll find, starting with our revamped documentation page.

Docs

OpenTracing - Distributed Tracing: A Mental Model


The main documentation page is the place to start learning about OpenTracing. There’s a quick-start introduction, an in-depth overview of the concepts of Distributed Tracing and its components, and finally a guide to best practices and common use cases for both application and framework developers.

Guides

OpenTracing - Distributed Tracing: Guides

Guides are intended to be a collection of in-depth information about how to use OpenTracing, beyond a simple ‘Hello, World’. We’ve broken these guides down by language, so there’s a single place to discover practical and in-depth information on how to use OpenTracing for its supported platforms. We’d love to have your contributions as part of these docs – check out the opentracing.io repo on GitHub and submit a pull request.

Project information

The OpenTracing specification and related documents now have a home on opentracing.io as well. You can read the spec, understand the governance model and project organization, read about the semantic conventions, and see what’s changed – all without navigating to multiple repositories.

Get involved

Interested in contributing? We’ll tell you how. This section of the site includes information on how to propose additions or changes to the OpenTracing specification, how to join a working group, and how to add your plugin or extension to our list of external contributions.

Easier to contribute

Previously, the website was built on the Jekyll static site generator. While this tooling worked, it could be cumbersome to install the required dependencies, especially for first-time contributors.

OpenTracing - Distributed Tracing - Hugo

By moving to Hugo, we were able to reduce the dependencies for local development down to a single, precompiled hugo executable.

Now, adding new guides or documentation is as straightforward as adding a new markdown page to the repository and opening a pull request. Hugo will automatically add the new content to the correct menu and section. We’re actively looking for more guide content, so head on over and contribute.

Looking forward

We’re especially excited about this new iteration of our documentation and hope that it makes it easier to use and discover information about OpenTracing. This isn’t the end of our work, though, it’s just a new beginning. We’re already starting to work on adding a searchable registry of tracers and plugins from independent contributors that we hope to ship soon. While we’ve launched with several guides in a more-or-less complete state, we have gaps. The C#, Golang, JavaScript, Ruby, and Service Mesh guides all have some stub sections that need to be filled out.

Our goal is to become the best resource on understanding and implementing Distributed Tracing, but to do that, we need your help. Check our list of open issues, see if there’s anything you’d be interested in working on, and make a PR. We’ve also got a channel dedicated to documentation on our Gitter, so feel free to ask any questions you might have there.

Happy tracing!

LightStep, OpenTracing, and Hot Soup

The Hot Soup Story

I’ve always been a problem solver, even as a child. A particularly memorable example of this is the time I decided I wanted soup, but my parents weren’t around. I took stock of my situation, and considered the following points:

  1. You make soup by putting it in a metal pan and heating it on the stove.
  2. Under no circumstances was I allowed to use the stove by myself.
  3. The microwave also makes things hot.
LightStep OpenTracing OSS
Don’t try this at home

Now, when confronted by the fact that the metal pan wouldn’t fit in the microwave due to its large handle, some children might have admitted defeat. I was cut from a different cloth, however, and removed the handle from the pot via a screwdriver, so it would fit into the microwave. I didn’t wind up with hot soup, but I did get a good light show out of the deal.

Over my years working with computers, and computer software, I’ve found similar stories crop up. We’re an industry of problem solvers, but we’re very concerned with what’s happening right now and often lack the critical context for our actions. Too often, we also get pretty light shows, but no hot soup from our changes and actions.

LightStep OpenTracing OSS
Context is key to understanding

Why I Joined LightStep

This is, among many others, the reason that I’m absolutely thrilled to join the team at LightStep as an engineer focusing on our open source projects, including OpenTracing. In the rapidly evolving world of modern software, distributed tracing isn’t a “nice to have,” it’s the single most important component of your software.

Prior to joining the team at LightStep, I led the Tools and Infrastructure team at Apprenda, working on our Enterprise Platform as a Service. I’m excited to bring my expertise and knowledge about the challenges facing both modern and traditional software enterprises to the LightStep team, and I’m even more excited to be a part of the OpenTracing project. During my time at Apprenda, I was a major advocate for OSS and led initiatives to make it easier for internal developers to contribute back to open source projects that we used, and publish new projects of our own. I can say that I really believe that the LightStep team “gets it” in terms of OSS – they understand, deeply, the benefit of open standards and interoperability. In addition, they see the value in having a culture steeped in best practices from the OSS world, both from an engineering perspective and also in internal governance and decision-making.

What’s Up with OpenTracing

I’m very excited about the future of both OpenTracing and distributed tracing in general! During my time at Apprenda, I personally witnessed the journey our customers were taking towards microservices and cloud native architectures, and I have no reason to believe that they’re suddenly going to stop. As the sophistication and complexity of our systems increase, we’re looking at a future that will be won or lost on the strength of how quickly and accurately these systems and their relationships can be observed. OpenTracing makes it possible to learn and implement a single API for observability, which makes it easier to instrument your services and components without worrying about having to support a mishmash of different applications and tools. Ultimately, OpenTracing provides the mechanism to deliver the critical context that you need to understand the behaviors of complicated systems.

Come Join Us

Sound fun? Like making soup too? Come talk to us!