One of my favorite things about LightStep is helping to mentor the next generation of developers through our internship programs. These students bring clear eyes and novel solutions to technically challenging areas of our product. Over the summer, my team had the opportunity to work with one of our interns on an area of the product that’s near and dear to all of our hearts here — benchmarking and profiling for high-performance tracing. As a part of this work, we’ve produced an experimental, high-performance, streaming tracer for Python using native bindings to our C++ libraries along with a benchmarking suite to simulate extremely high-throughput tracing scenarios.

We’ve always had a strong commitment to code quality, but it can be challenging to test high-throughput scenarios for our tracing libraries, as the nature of “high throughput” can differ depending on the exact topology of a system being traced. In order to quantify the changes we were making in the experimental library, we built a new tool which simulates a variety of real-world scenarios: a process handling tens of thousands of requests per second, a CPU under maximum utilization, or even inconsistent network connectivity between a process and a LightStep Satellite. One guiding principle behind developing this benchmark suite was to ensure that we’re testing in production-like conditions.

Creating a benchmarking suite that accurately models ‘production systems’ was instructive not only for Isaac, but also for us. Having someone with an outside perspective asking questions forces you to re-evaluate your existing preconceptions, and also makes you double-check the things you already ‘know’. An early version of the benchmarking tool, for example, used a simpler algorithm –

1
2
3
for i in range(1000):
    with tracer.start_active_span():
        pass

What’s wrong with this? We’re generating a bunch of spans and seeing how long it takes to process them, right? Well, real spans generally have other stuff attached to them like logs, and tags. Moreover, a tight loop like this benefits from local caching a great deal which artificially speeds up the creation of spans and moves us further away from a production-like scenario. In lieu of this, the tool was modified to benchmark what effectively became self-contained client processes –

1
2
3
4
5
6
<span style="font-weight: 400;" data-mce-style="font-weight: 400;">while</span><span style="font-weight: 400;" data-mce-style="font-weight: 400;"> curTime &lt; endTime:</span>
    with tracer.start_active_span() as span:
        span.add_logs()
        span.add_tags()
        do_work()
    sleep(rand_interval)

These processes can be run for an arbitrary amount of time with a random sleep, giving us a less synthetic and more realistic set of performance data.

Creating this benchmark suite was an important first step in quantifying the results of our experimental high-performance tracer suites. As part of our commitment to being a best-in-breed tracing platform, we need to ensure that we offer a variety of tracers that can be used in a variety of applications. One such example is our Python tracer. Python is a popular, general purpose programming language. It’s becoming increasingly common, however, to see Python deployed in a variety of high-performance computing situations powering features such as data analysis, recommendation engines, and other machine learning tasks. In these circumstances, it can be beneficial to sacrifice some of the ease-of-use of a pure Python tracer and use native bindings to C/C++ libraries to gain increased performance. We recently shipped a new streaming protocol in our C++ tracer that allows for forwarding trace data to multiple Satellites concurrently, using non-blocking writes. Memory management is key to this strategy, ensuring that we only copy a span a single time during the reporting process, and we forward spans as they’re generated to avoid excessive allocations. The benchmarking suite allowed us to not only view the results of this work, but to compare it to our existing pure Python tracer.

This chart demonstrates a sample of the results, showing the difference between the experimental tracer and the standard one. At around 5000 spans per second, the streaming tracer offers a 10x reduction in CPU usage of the tracer – an impressive performance difference! Keep in mind, however, that this isn’t suitable for all deployment configurations. Certain serverless or containerized deployments will have difficulty using the native bindings. You can try it out for yourself by installing the lightstep-streaming package from PyPi and modify your tracer configuration. Keep in mind, this is still an experimental package!

Where do we go from here? Our next step is to continue to build on the foundation of the benchmarking package by donating it to the OpenTelemetry project, ensuring that the future of telemetry instrumentation packages are built to prioritize performance in all languages. We’d also love to see OpenTelemetry offer native bindings for Python and other languages where it makes sense, and are interested in how these high-performance options are adopted.

Ultimately, telemetry data should be collected without causing unacceptable levels of overhead for your service, and we’re excited to push the state-of-the-art forward for the observability community as a major contributor to OpenTelemetry. I’d like to thank Isaac for the research that has been gathered which went into this blog, and for all of our other interns who continue to make real customer impacting improvements to our product. Our 2020 program is accepting applicants now — if you’d like to be part of the team, get in touch!