Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

From Day 0 to Day 2: Reducing the anxiety of scaling up cloud-native deployments

Kubernetes was specifically built to support massive scale and rapid elasticity, but deploying it at scale can still present worrying challenges for development and operations teams.

We’ve seen a unique ‘potato chip’ pattern in Kubernetes adoption within enterprises, where once the first K8s clusters pop, they can’t stop. The initial success causes teams to quickly deploy two, then 20, then 200 clusters, then what seems like way-too-many headed for production.

Faced with this increasing complexity, can development and operations teams efficiently scale operations while keeping everything secure and working properly, without burning everyone out in the process?

There’s no good reason for anxiety about the need for scale and speed when so many other organizations have successfully reaped the benefits of automation and observability for cloud native delivery.

Dissolving the illusion of mastery

When any game-changing technology paradigm appears, the first instinct of a well-established software delivery organization is to take a measured approach and hold onto well-governed processes and tools they already have to mitigate risk. 

Often, an ‘innovation team’ will experiment with new technologies and alternative use cases in relative isolation, while the rest of the business carries on as usual. This is natural for a space like cloud-native computing, where the rate of technology change outpaces the availability of documentation and learning.

While keeping existing platforms and delivery pipelines may defer some risk, it also defers the real transformation necessary for companies to thrive in a hypercompetitive market – not just for customers, but also in the hunt for development talent.

A new generation of "born in the cloud" developers and engineers are coming into the talent market every day, having never known the old ways of doing things. The ability to work in a modern cloud-native shop and participate in open source communities isn’t just a perk or hobby; it’s something the best talent will expect from an employer.

Growing operational awareness with limited resources

Even with the titans of technology taking this uncertain opportunity to lay off 10% or more of their workforces, there’s not going to be relief for most hiring managers anytime soon. The average medium-to-large sized company also had a 25% or more increase25% or more increase in new job requisitions for technical talent last year. 

Since developers and engineers with cloud-native experience are in high demand or are starting their own ventures, organizations must upskill from within.

Cloud native asks developers to become operators who can understand details of how their code will interact with infrastructure in deployment as it scales. To an extent, it’s expecting them to become networking and security professionals as well.

Here’s where observability comes into the picture, culling data from clusters, pods, and serverless functions to provide teams with an X-ray view of the internal workings of microservices applications as they work in production. 

OpenTelemetry is great, but not a magical solution for data

One open source project that is taking the cloud-native observability space by storm is OpenTelemetryOpenTelemetry (or, OTel), which arose from the OpenTracing project (with roots in LightstepLightstep) and OpenCensus, now combined in a CNCF incubated project. 

Using OpenTelemetry, engineers can instrument vendor-agnostic telemetry into application and infrastructure code, and leverage a rich SDK of tools for generating, collecting, and sharing high-fidelity OTLPOTLP data from metrics, logs, and traces which are portable across observability platforms.

OpenTelemetry is exactly the kind of industry-wide advancement we can all get behind. Vendors will merge and change, so we will always want portable, commonly understood telemetry data that will be accepted by existing and future observability tools.

Capturing and collecting rich and reusable data is great, but are we doing it in the right places? And what should we do with the huge volume of OTel data?

One of the most frequent complaints about OpenTelemetry is how difficult it can be to get your pipelines set up exactly right, especially in Kubernetes environments where additional complexities are inevitable. Teams just want telemetry data to be consistent, appropriately tagged, and processed as it enters the observability data pipeline. A strong observability culture and semantic conventions can help teams overcome the uncertainty of data and alert overload, without creating too much data exhaust for everyone.

Rethinking data and alert streams to prevent burnout

Cloud-native applications run and scale on ever-changing ephemeral architectures, and they are configured, networked, orchestrated, and secured using a shifting landscape of CNCF open sourcelandscape of CNCF open source projects and commercial packaged offerings. Many of these tools will change significantly or become forgotten and sidelined over the life of an application.

Instrumenting so many moving parts in production would generate a firehose of telemetry data, and as a result, a constant barrage of alerts for dev and ops teams that are already stretched to the limit. Fortunately, there are good ways to reduce the stress and anxiety of cloud-native observability.

Shape incoming telemetry data

If you want less garbage, start by putting less garbage data in. The OTel Collector function can be set up for early pre-pipeline data normalization and compression before exporting to observability platforms. Vendors with additional in-memory data pre-preparation tools and services can intercede further to improve ingress data quality and relevance.

Set common SLI (service level indicator) guidelines

If teams agree to set SLIs with consistent performance measurements, error conditions, and common naming conventions across multiple technologies, that will go a long way toward reducing confusion and conflict. 

Aside from the usual ‘golden signals’ of latency, errors, traffic, and saturation on servers (or workloads), common semantics for application events, messages, and transactions will make it far easier for other developers or SREs to step in and get to the bottom of a problem.

Separate anomalies from alerts and incidents

A function or API call that generates different data today than it did yesterday may appear as an anomaly to the system, or be a ‘flapping error,’ even if it actually has no impact on the SLO (service level objective) in question. Any alerts that make it through should be filtered for priority and criticality, before lighting up everyone’s pagers in the middle of the night. AIOps solutions such as Lightstep Change IntelligenceLightstep Change Intelligence can further filter the incoming alert storm and help track down the exact point in time and contributors to the change that caused an error. Then, first responders to escalated incidents can quickly drop in and resolve the real problems with fewer alert noise distractions.

The Intellyx Take

The global cloud-native development community writ large is currently facing a reckoning with too many tools, too much telemetry data, and not enough skilled people to make sense of it all. 

If this describes your organization, at least take comfort that you aren’t alone with such growing pains.

What have I learned from interacting with some of the best performing dev shops? They tend to avoid the attrition of skilled engineering talent by not burning them out with unnecessary toil and nagging alerts when the scope and scale of deployments increased.

In a high-anxiety environment, only the calm survive. So why worry?

© 2023 Intellyx, LLC. At the time of writing, Lightstep from ServiceNow is an Intellyx customer.

March 7, 2023
6 min read

Share this article

About the author

Jason English

Jason English

Read moreRead more

OpenTelemetry Collector in Kubernetes: Get started with autoscaling

Moh Osman | Jan 6, 2023

Learn how to leverage a Horizontal Pod Autoscaler alongside the OpenTelemetry Collector in Kubernetes. This will enable a cluster to handle varying telemetry workloads as the collector pool aligns to demand.

Learn moreLearn more

Observability-Landscape-as-Code in Practice

Adriana Villela, Ana Margarita Medina | Oct 25, 2022

Learn how to put Observability-Landscape-as-Code in this hands-on tutorial. In it, you'll use Terraform to create a Kubernetes cluster, configure and deploy the OTel Demo App to send Traces and Metrics to Lightstep, and create dashboards in Lightstep.

Learn moreLearn more

OpenTelemetry for Python: The Hard Way

Adriana Villela | Sep 20, 2022

Learn how to instrument your Python application with OpenTelemetry through manual configuration and manual context propagation, using Lightstep as the Observability back-end.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems