OpenTelemetry is an open-source observability framework for generating, capturing, and collecting telemetry data for cloud-native software. In the previous posts in this series, I discussed what observability is, as it relates to OpenTelemetry, and what tracing is. Today, I’d like to cover the second major area of the OpenTelemetry API: metrics.

OpenTelemetry’s metrics API supports reporting diagnostic measurements from a service using the three basic kinds of instruments. These instruments are commonly known as counters, gauges, and measures. Developers use these objects in order to gain visibility into operational metrics about their service.

Most developers are familiar with metrics in some fashion. It’s extremely common, for instance, to monitor metric values such as process memory utilization or error rate, and to create alerts to indicate when a service is violating a predetermined threshold. In addition to these common measurements, metrics events streaming to these instruments can be applied in other unique ways, including by being aggregated and recorded by tracing or logging systems. With that in mind, let’s look at the instruments available through the OpenTelemetry Metrics API and discuss how they can be used.

The API distinguishes between the metric instruments available through their semantic meaning rather than the eventual type of the value they export. This is somewhat unconventional, and it stems from the design of OpenTelemetry itself – the separation between the API and the SDK forces the SDK to ultimately determine what should happen with any specific metric event, and could potentially implement a given instrument in a non-obvious or non-standard way. If you’re familiar with existing metrics APIs (such as the Prometheus API), this explains why there’s no method to export a histogram or summary distribution. These are considered to be measures, and the SDK can be configured to export a histogram or a summary from a measure.

From a developer’s point of view, the exact backing instrument is somewhat opaque by design, as the API is designed for portability. Developers use the Add, Set, and Record methods with optional declarations to set restrictions on the specific measure (for example, to allow a counter to support positive and negative values) and the SDK takes care of the rest. That said, which of these should you use for any given scenario?

If you’re trying to record a count of something – such as the sum of errors over time – then you should use the Add method, which is supported by a counter. A counter is, by default, monotonic, which means it only expects positive values. This property allows a counter by default to be interpreted as a rate measurement. Rates are a comparison between two quantities – bytes transferred per second, for example. Non-monotonic counters are also supported, which can be useful for reporting changes in a quantity (like the number of elements in a set as they’re added and removed).

What about circumstances where you need a count, but not a quantity? For example, if you’re measuring something over an arbitrary interval or you don’t care about the rate of change, just the current value? This is when you would use a gauge, by calling Set. You can think about a gauge like a thermostat, or a ratio (such as memory consumption vs. total memory available). As you may expect, the default for these instruments is non-monotonic, but a monotonic option is available as well. Why? To report a change, versus a sum. If you’re trying to report a sum you should use a gauge, but if you’re reporting increments then you should use a counter.

Finally, what should you use when you care about individual measurements? Consider a circumstance where you want to know both the count and sum of a given event, or when you’re interested in grouping events into quantiles? You would use Record for these values, as it expresses a measure. A common application of measures is to create a histogram or summary of values. You could use these to calculate averages, like the average size of a particular response, or the average duration of HTTP requests.

This is getting rather long, so let’s do a quick review. There are three instruments you can use in OpenTelemetry, each defined by the method you call to send a metric event. They are:

  • Counters, which you Add a value to. These are good for values that you’d like to think of as a rate, or changes in a quantity.
  • Gauges, which you Set the value of. You can think of these as either a car’s odometer (a monotonic gauge, it never decreases) or a car’s speedometer (a non-monotonic gauge, as it can go up and down.)
  • Measures, to which you Record a value. These are useful to build histograms or summaries, metric projections that let you calculate averages of many values.

The exact mechanism by which you’ll use each of these instruments is a bit language-specific – OpenTelemetry, by design, is trying to allow each language SIG to implement the API in a way that is conventional for the language it’s being implemented in. This means the exact details of creating a new metric event may not match the specification precisely, you should consult the documentation for your particular OpenTelemetry implementation for more information. At a high level, however, here’s how it works.

First, you’ll need to create an instrument of the appropriate kind, and give it a descriptive name. Each instrument name needs to be unique inside its process. You can also provide label keys, which are optional key values that are used to optimize the metric export pipeline. You’ll also need to initialize a LabelSet, which is a set of labels (both keys and values) that correspond to attributes being set on your metric events. What does this look like?

1
2
3
4
5
6
7
8
9
// in our init or in middleware...
requestKeys = ["path", "host"] // label keys

// now we initialize instruments for our keys, a Count and a Measure
requestBytes = NewIntCounter("request.bytes", WithKeys(requestKeys), withUnit(unit.Bytes))
requestLatency = NewFloatMeasure("request.latency", WithKeys(requestKeys), withUnit(unit.Second))

// then, define the labels for each key (and add more if required)
labels = meter.DefineLabels({path: /api/getFoo/{id}, host: host.name})

Again, the specifics are going to be slightly different for each language, but the basic gist is the same – early on, you create actual metric instruments by giving them a name, and telling them what keys they’ll be seeing. After that, either explicitly add labels for your instruments (or get a blank set of labels, if appropriate). Once you’ve accomplished this, actually recording metric events is fairly straightforward.

1
2
requestBytes.Add(labels, req.bytes)
requestLatency.Record(labels, req.latency)

Another option, for high-performance scenarios, is to utilize a handle. This effectively skips a step, by precomputing the instrument and labels (after all, you can add arbitrary labels to a label set, and calling an instrument through the label set will add that event to each combination of label + instrument), useful if you’re in a performance-critical section of code.

1
2
3
4
5
6
7
8
9
requestBytesHandle = requestBytes.GetHandle(labels)
requestLatencyHandle = requestLatency.GetHandle(labels)

for req in requestBatch {
    
    requestBytesHandle.Add(req.bytes)
    requestLatencyHandle.Record(req.latency)
    
}

One thing to remember with handles, however, is that you’re responsible for ‘cleaning up’ after you’re done by freeing the handle and deallocating it!

So, to summarize one more time:

  • Metric events are recorded through instruments.
  • You need to create a unique instrument for each metric you want to record, and give it a unique name.
  • You can apply additional metadata to your instrument through labels.
  • Recording metric events is performed either by calling the appropriate method on the instrument itself, or by getting a precomputed handle and calling the method there.

Hopefully, this has given you a better understanding of how the OpenTelemetry metrics API functions. It’s a lot to take in! The real power of OpenTelemetry, however, isn’t just that it provides a tracing and metrics API for you to use. In my next post, I’ll cover the exporter model, and the OpenTelemetry Collector, which are the ways you’ll get data out of OpenTelemetry and into analysis systems such as LightStep.