OpenTelemetry Best Practices: Using Attributes
In OpenTelemetry, attributes are arbitrary key:value pairs that provide context to a distributed trace. Attributes can give teams the raw data to find meaningful correlations, and a clear view of what was involved when performance changes occur. Be it for root-cause analysis, or forward-looking performance optimizations, attributes are critical for the ability to filter, search through, visualize, or otherwise analyze in the aggregate.
What sort of attributes make a span, log, or metric “useful?” Anything that can help explain a variation in performance.
Here are some best practices for using attributes in OpenTelemetry:
Here are several segments of a software system and examples of useful associated attributes.
User-related attributes provide context about your application’s users. This includes, but is not limited to:
- Customer segment
- Customer ID (no PII!)
- Geo data
- Device type
- OS version
Software-related attributes provide context about your application’s software. This includes, but is not limited to:
- Deployment ID
- Feature flags
Data-related attributes provide context about your application’s software. This includes, but is not limited to:
- Request/response size
- Whether a cache has been used
- Caching timestamps
Infrastructure-related attributes provide context about your application’s infrastructure. This includes, but is not limited to:
- Host ID
- Data center
- Orchestration (cluster, pod, or node IDs)
Semantic and standardized attributes help ensure efficient root-cause analysis. Make sure your attributes are clear, descriptive, and apply to the entirety of the resource they are describing.
- Use semantic attribute names
- Define namespaces for your attributes
- This is especially important when multiple service teams have their own standard attributes
- Keep attribute names short and sweet
- Set error attributes on error spans
Here is an example set of general attributes for a network connection:
|Attribute name||Notes and examples|
|Transport protocol used.|
|Remote address of the peer (dotted decimal for IPv4 or RFC5952 for IPv6).|
|Remote port number as an integer. E.g., |
|Remote hostname or similar.|
|Local hostname or similar.|
Another best practice that is tangential to the naming of attributes is creating a shared library of attributes. The practice of creating a library of known attributes helps you catalog the data you care about, and their documentation creates a record of the data that is important to your customers.
When multiple teams are going to be sharing attributes, it is important to standardize them to avoid discrepancies.
In terms of attributes and telemetry, cardinality is a measure of the number if different dimensions along which you are recording – and later, possibly querying – telemetry data. Concretely, it’s the product of the number of different values for every attribute, so if you have two attributes with five values each, that’s a cardinality of 25; add in a third tag with three values, now you’re up to 75. The more dimensions to your data (that is, the more attributes you are using and the more values those attributes have), the less data can be aggregated, and the more data that must be stored.
With traditional metrics tools, developers need to be mindful of the costs associated with highly-dimensional data, as each new dimension adds to cost, and since cardinality can grow quickly, so can that cost. However, OpenTelemetry provides exporters to many sorts of tools, not just metrics solutions. Some, like distributed tracing and other generic observability platforms, use intelligent techniques to aggregate data, meaning that you are free to use whatever attributes you’d like, regardless of the resulting cardinality.
Head over to Dimensionality in Observability for more depth on the topic.
When deciding what attributes to tag your trace data with, remember that your application’s focus is to provide a high-quality software experience to customers. This mission is encoded into your service/application’s Service Level Objectives (SLOs), maybe in the form of a 99.999% uptime expectation. From the SLO, you can narrow down which Service Level Indicators (SLIs) best support or are most likely to threaten achieving SLOs. Your attributes should support your service levels.
For example, if you have latency SLOs that differ between customers, leveraging attributes that provide customer dimensionality like
customerID can help you organize alerts accordingly.
Think of attributes as the root source of pattern-matching in a distributed system. If you want to investigate relationships across and between categories, attributes are the vehicle for sorting and comparing.
Incrementally experiment with different attributes and see what shakes! Let’s consider an example.
Customers contacting support because of latency? Didn’t ‘Servize Z’ roll out a new build this morning? Correlating an attribute like
version directly against latency would help make any connection between new deploys and a performance regression very clear.
We’ve been focusing on the ‘Do’s’ of attributes, but here is a closer look at some attribute pitfalls to avoid:
- Attributes that are not semantic! Always use semantic attributes
- The attribute you don’t use can’t help you
- Tags with a cardinality of 1 may not be great in some circumstances - can be useful if you’re trying to integrate a monitoring system using uuids
- Don’t put stack traces, etc. in attributes. Use events for this instead
- Beware tag key duplication (either overwriting a key on the same span, or having two of the same values with different names)
- Beware unset values
Attributes give developers more detailed telemetry. As an opportunity to provide more detail, attributes should be clear, descriptive, and useful.
The most important factor to consider is whether or not the attributes you are tracking and storing can explain variations in performance.