Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

OpenTelemetry Collector in Kubernetes: Get started with autoscaling

When sending data to Lightstep using an OpenTelemetry Collector in Kubernetes, you can take advantage of a Horizontal Pod Autoscaler to better manage fluctuating workloads.

A Horizontal Pod AutoscalerHorizontal Pod Autoscaler (HPA) is useful if telemetry traffic to the OpenTelemetry Collector fluctuates. If the volume of telemetry received by the collector spikes, there's a risk of exceeding the configured resource limits potentially causing the collector pod to crash. Without an autoscaler, handling these spikes requires configuring higher resource limits. This is inefficient because the higher resource limits are only needed when traffic to the collector spikes. In an ideal world, you would be aware of when these spikes would occur (e.g. holidays, or major events).

With autoscaling, the HPA will monitor the collector pod’s resource metrics and trigger a scale up or down based on the configuration. The added flexibility is convenient because any pods created during a period of increased telemetry load will eventually be removed when they are no longer needed. Additionally, after scaling up, the collector pool will not have a single point of failure when telemetry data is spiking.

Kubernetes Cluster in Lightstep

The diagram above represents the namespace (in an existing Kubernetes cluster) that contains the Otel Operator and Otel Collector. This diagram shows that the Operator is responsible for creating the initial Otel Collector Pod and the HPA.  Then the HPA is responsible for monitoring the Collector pool's average metric utilization and scaling the pool when the collectors have violated the configured threshold.

How to set up the HPA

The HPA can be enabled using the operator’s OpenTelemetryCollector Custom Resource Definition (CRDCRD). The OpenTelemetry Operator will create the Collector and the Horizontal Pod Autoscaler (HPA) based on the configuration in the CRD.

In the CRD spec, make sure to set the minReplicas and maxReplicas field to create the autoscaler. Then in the autoscaler field, you can further specify scaling behavior:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: {{ .Release.Name }}-collector
spec:
  mode: statefulset
  image: otel/opentelemetry-collector-contrib:latest
  minReplicas: 1
  maxReplicas: 10
  autoscaler:
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 10
      scaleDown:
        stabilizationWindowSeconds: 15
    targetCPUUtilization: 70
...

Once you've made the necessary changes, you can apply the updated config to your cluster.

Confirm it’s working

To verify that the HPA has been created by the operator, run the command:

$ kubectl get hpa -n <your-namespace>
NAME                            REFERENCE                                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
lightstep-collector-collector   OpenTelemetryCollector/lightstep-collector   50%/70%   1         40        10         27m

More information about the HPA can be found running:

$ kubectl describe hpa <hpa-name> -n <your-namespace>

The image below shows a collector pool experiencing a metrics spike and the autoscaler increasing the number of pods to handle the increased load. If metrics are being exported to Lightstep, then it's possible to view your collector metrics while using autoscaling to track performance. After the metrics spike, the autoscaler scales the collector pool back down to a single pod.

Otelcol process uptime grouped by collector name


This is an example of a Lightstep dashboard that monitors incoming metrics. In the Otel Collector config, metrics are received using the Prometheus Receiver and are exported to Lightstep. I also configured an HPA for my collector and increased the metrics load my collector was experiencing. This tested that more pods are added to the collector pool, to handle the metric spike.  And after the spike, the additional pods are removed from the collector pool.  The metric displayed is otelcol_process_uptime grouped by collector_name.

A note on Prometheus Receivers: If you’re using a Prometheus receiver as part of your pipeline, the preferred workload manager for the collector pool is mode: statefulset. This allows the user to leverage the Target Allocator to distribute scrape targets among the collectors. Using mode: deployment is not ideal because without the Target Allocator to load balance, scrape targets might get assigned to only one collector (or otherwise multiple collectors would scrape the same targets). If using a deployment is necessary, a Vertical Pod Autoscaler might be preferred to increase/decrease resources for the single pod that is under load.

Latest enhancements

In the most recent Opentelemetry OperatorOpentelemetry Operator release v0.66.0, the following enhancements to the autoscaler are available:

  • Expose Horizontal Pod Autoscaler behavior to modify scaling behavior

  • Upgrade Horizontal Pod Autoscaler from autoscaling/v1 to simultaneous support of autoscaling/v2 and autoscaling/v2beta2

  • Customize value for Average CPU Utilization

Upcoming enhancements

As the community continues to work on the HPA, here are a few of the features that are planned:

  • Expose functionality to scale on memory (currently only CPU is an available resource target)

  • Expose functionality to scale on custom metrics (supported in autoscaling/v2)

  • Clean up spec so that all autoscaler related fields are under the autoscaler field in the CRD

Note: The HPA for the Otel Operator is currently under active development, so usage may be limited as more features are added.

Conclusion

While it can be stressful to predict the amount of resources your OTel collector will need ahead of time, it can be even more stressful to try to configure the amount of resources in the middle of a spike. Configure too little CPU or memory and a traffic spike could kill the collector and cause data to drop until the spike is resolved. Configure too much, however, and you’re using resources inefficiently.

The Horizontal Pod Autoscaler eliminates the guesswork, which improves operational efficiency, reduces cost, and helps improve reliability. After the HPA autoscales up, traffic spikes are distributed over several pods which means there is no single point of failure. Autoscaling is yet another valuable feature the OpenTelemetry Operator provides to give you more flexibility.

See how you can improve operational efficiency and site reliability.
Schedule a demo

Schedule a demo
with our team.

January 6, 2023
5 min read
OpenTelemetry

Share this article

About the author

Moh Osman

From Day 0 to Day 2: Reducing the anxiety of scaling up cloud-native deployments

Jason English | Mar 7, 2023

The global cloud-native development community is facing a reckoning. There are too many tools, too much telemetry data, and not enough skilled people to make sense of it all.  See how you can.

Learn moreLearn more

Observability-Landscape-as-Code in Practice

Adriana Villela, Ana Margarita Medina | Oct 25, 2022

Learn how to put Observability-Landscape-as-Code in this hands-on tutorial. In it, you'll use Terraform to create a Kubernetes cluster, configure and deploy the OTel Demo App to send Traces and Metrics to Lightstep, and create dashboards in Lightstep.

Learn moreLearn more

OpenTelemetry for Python: The Hard Way

Adriana Villela | Sep 20, 2022

Learn how to instrument your Python application with OpenTelemetry through manual configuration and manual context propagation, using Lightstep as the Observability back-end.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems