When sending data to Lightstep using an OpenTelemetry Collector in Kubernetes, you can take advantage of a Horizontal Pod Autoscaler to better manage fluctuating workloads.
A Horizontal Pod AutoscalerHorizontal Pod Autoscaler (HPA) is useful if telemetry traffic to the OpenTelemetry Collector fluctuates. If the volume of telemetry received by the collector spikes, there's a risk of exceeding the configured resource limits potentially causing the collector pod to crash. Without an autoscaler, handling these spikes requires configuring higher resource limits. This is inefficient because the higher resource limits are only needed when traffic to the collector spikes. In an ideal world, you would be aware of when these spikes would occur (e.g. holidays, or major events).
With autoscaling, the HPA will monitor the collector pod’s resource metrics and trigger a scale up or down based on the configuration. The added flexibility is convenient because any pods created during a period of increased telemetry load will eventually be removed when they are no longer needed. Additionally, after scaling up, the collector pool will not have a single point of failure when telemetry data is spiking.

The diagram above represents the namespace (in an existing Kubernetes cluster) that contains the Otel Operator and Otel Collector. This diagram shows that the Operator is responsible for creating the initial Otel Collector Pod and the HPA. Then the HPA is responsible for monitoring the Collector pool's average metric utilization and scaling the pool when the collectors have violated the configured threshold.
How to set up the HPA
The HPA can be enabled using the operator’s OpenTelemetryCollector
Custom Resource Definition (CRDCRD). The OpenTelemetry Operator will create the Collector and the Horizontal Pod Autoscaler (HPA) based on the configuration in the CRD.
In the CRD spec, make sure to set the minReplicas
and maxReplicas
field to create the autoscaler. Then in the autoscaler field, you can further specify scaling behavior:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: {{ .Release.Name }}-collector
spec:
mode: statefulset
image: otel/opentelemetry-collector-contrib:latest
minReplicas: 1
maxReplicas: 10
autoscaler:
behavior:
scaleUp:
stabilizationWindowSeconds: 10
scaleDown:
stabilizationWindowSeconds: 15
targetCPUUtilization: 70
...
Once you've made the necessary changes, you can apply the updated config to your cluster.
Confirm it’s working
To verify that the HPA has been created by the operator, run the command:
$ kubectl get hpa -n <your-namespace>
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
lightstep-collector-collector OpenTelemetryCollector/lightstep-collector 50%/70% 1 40 10 27m
More information about the HPA can be found running:
$ kubectl describe hpa <hpa-name> -n <your-namespace>
The image below shows a collector pool experiencing a metrics spike and the autoscaler increasing the number of pods to handle the increased load. If metrics are being exported to Lightstep, then it's possible to view your collector metrics while using autoscaling to track performance. After the metrics spike, the autoscaler scales the collector pool back down to a single pod.

This is an example of a Lightstep dashboard that monitors incoming metrics. In the Otel Collector config, metrics are received using the Prometheus Receiver and are exported to Lightstep. I also configured an HPA for my collector and increased the metrics load my collector was experiencing. This tested that more pods are added to the collector pool, to handle the metric spike. And after the spike, the additional pods are removed from the collector pool. The metric displayed is otelcol_process_uptime
grouped by collector_name.
A note on Prometheus Receivers: If you’re using a Prometheus receiver as part of your pipeline, the preferred workload manager for the collector pool is
mode: statefulset
. This allows the user to leverage the Target Allocator to distribute scrape targets among the collectors. Usingmode: deployment
is not ideal because without the Target Allocator to load balance, scrape targets might get assigned to only one collector (or otherwise multiple collectors would scrape the same targets). If using a deployment is necessary, a Vertical Pod Autoscaler might be preferred to increase/decrease resources for the single pod that is under load.
Latest enhancements
In the most recent Opentelemetry OperatorOpentelemetry Operator release v0.66.0, the following enhancements to the autoscaler are available:
Expose Horizontal Pod Autoscaler behavior to modify scaling behavior
Upgrade Horizontal Pod Autoscaler from autoscaling/v1 to simultaneous support of autoscaling/v2 and autoscaling/v2beta2
Customize value for Average CPU Utilization
Upcoming enhancements
As the community continues to work on the HPA, here are a few of the features that are planned:
Expose functionality to scale on memory (currently only CPU is an available resource target)
Expose functionality to scale on custom metrics (supported in autoscaling/v2)
Clean up spec so that all autoscaler related fields are under the autoscaler field in the CRD
Note: The HPA for the Otel Operator is currently under active development, so usage may be limited as more features are added.
Conclusion
While it can be stressful to predict the amount of resources your OTel collector will need ahead of time, it can be even more stressful to try to configure the amount of resources in the middle of a spike. Configure too little CPU or memory and a traffic spike could kill the collector and cause data to drop until the spike is resolved. Configure too much, however, and you’re using resources inefficiently.
The Horizontal Pod Autoscaler eliminates the guesswork, which improves operational efficiency, reduces cost, and helps improve reliability. After the HPA autoscales up, traffic spikes are distributed over several pods which means there is no single point of failure. Autoscaling is yet another valuable feature the OpenTelemetry Operator provides to give you more flexibility.
See how you can improve operational efficiency and site reliability.
Schedule a demo
Schedule a demo with our team.
Explore more articles

From Day 0 to Day 2: Reducing the anxiety of scaling up cloud-native deployments
Jason English | Mar 7, 2023The global cloud-native development community is facing a reckoning. There are too many tools, too much telemetry data, and not enough skilled people to make sense of it all. See how you can.
Learn moreLearn more
Observability-Landscape-as-Code in Practice
Adriana Villela, Ana Margarita Medina | Oct 25, 2022Learn how to put Observability-Landscape-as-Code in this hands-on tutorial. In it, you'll use Terraform to create a Kubernetes cluster, configure and deploy the OTel Demo App to send Traces and Metrics to Lightstep, and create dashboards in Lightstep.
Learn moreLearn more
OpenTelemetry for Python: The Hard Way
Adriana Villela | Sep 20, 2022Learn how to instrument your Python application with OpenTelemetry through manual configuration and manual context propagation, using Lightstep as the Observability back-end.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems