Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

How to Operate Cloud Native Applications at Scale

The Importance of Cloud Native Observability

This is the final post in a series about the journey to cloud nativethe journey to cloud native, the origin of cloud native observabilitythe origin of cloud native observability, and scaling up cloud-native deployments.scaling up cloud-native deployments. It explores the challenges of operating cloud-native applications at scale and in many cases, massive, dynamic scale across geographies and hybrid environments.

Scalable, dynamic applications are the point of cloud-native infrastructure. As enterprises ramp up their cloud-native deployments, in particular, Kubernetes, they quickly jump from a small number of clusters to vast numbers of clusters.

They now require an architecture consisting of clusters of clusters (aka multiclusters), typically scattered across different flavors of Kubernetes in different clouds, as well as on-premises components that may incorporate legacy application assets.

They now require an architecture consisting of clusters of clusters (aka multiclusters), typically scattered across different flavors of Kubernetes in different clouds, as well as on-premises components that may incorporate legacy application assets.

Managing such global hybrid multicluster applications is a Herculean task. Today’s cloud management and observability tools address the challenges of managing modest Kubernetes deployments, but once an organization scales up, such tools struggle to provide the management capabilities those enterprises require.

What’s missing isn’t better tooling, although it’s true that many tools don’t scale well. Rather, the missing piece is more likely cloud-native operations best practices, practices for leveraging management and observability tools following the same cloud-native principles of scale and dynamic behavior as the infrastructure they are managing.

Understanding Cloud Native Operations at Scale

Cloud-native computing is a broad, comprehensive paradigm shift in the way enterprises and web scale companies build and run IT infrastructure. At the heart of this shift is Kubernetes, and at the heart of Kubernetes lies the platform’s approach to elasticity and scale.

Kubernetes’ architecture calls for microservices in containers, containers in pods, pods in clusters, and clusters in their own multiclusters – all running on nodes (generally cloud-based virtual machines), typically with multiple clusters per node.

Each step in this chain reflects a “many to one” relationship, where the Kubernetes infrastructure handles the “many” part – in particular, how it handles many containers per pod and pods per cluster.

Kubernetes – now including a vast ecosystem of open-source and commercial projects and products – handles this autoscaling rapidly and automatically, thus causing ephemeral microservices and other software components to appear and disappear in the blink of an eye.

Operators must manage this scalable and dynamic infrastructure and be on the lookout for potential problems that might impact the performance or availability of the deployed applications.

To accomplish this task, the various components of the cloud-native infrastructure must be observable, typically by generating logs, traces, and metrics following the maturing OpenTelemetry standard.

Operators also need sophisticated automation to have any hope of keeping up with the ever-changing landscape. Humans simply cannot turn the knobs and dials in the production environment quickly or accurately enough to deal with cloud native applications.

Operating Cloud Native Environments at the Extremes of Scalability

Observability Landscape image

As organizations scale their cloud native applications across hybrid multicloud environments and globally distributed cloud regions, data centers, and edge locations, deploying perhaps thousands of clusters, the requirements for observability and automation take on new meaning.

The number of containers across such a global application might be in the millions. That number will ebb and flow every second as traffic patterns, network conditions, and other forces demand.

And every container – and pod, node, physical server, network endpoint, etc. – will generate its own never-ending stream of telemetry.

If the operations team already has its hands full with a small number of clusters, they will be overwhelmed by the volume of telemetry To have any hope of managing such volume, they must follow a core set of best practices:

  • Prepare to explode – There’s an old joke that when counting your Kubernetes clusters, there are only two numbers: “one” and “many.” The same principle applies to multiclusters, hybrid multiclusters, and global hybrid multiclusters. Even as an organization begins to ramp up, it must plan ahead to deal with such an explosion.

  • Build a consistent observability landscape – Implement an observability-as-code approach that ensures repeatable, consistent, maintainable data and widely accessible insights. Observability-as-code drives the automation of operational tasks via instrumentation of IaC with OpenTelemetry (see the article Observability Mythbusters: Yes, Observability-Landscape-as-Code is a ThingObservability Mythbusters: Yes, Observability-Landscape-as-Code is a Thing for more information).

  • Follow the Golden Path with good hygiene practices – The “Golden Path” is a term from platform engineering that reflects the best practice approach for achieving the goal in question by leveraging the most straightforward, recommended guidelines that the existing tooling provides. Just as there is a Golden Path for DevOps, there's also one for observability-based operations management at scale.

While automation is essential for Kubernetes management, automating the configuration, use, and management of the automations themselves is critical for managing cloud-native deployments at massive scale. Following the best practices above will help make such automation a reality.

The Intellyx Take

Choosing a cloud-native observability tool is necessary, but by no means sufficient to achieve effective operations management of global hybrid multicluster-based applications.

Adopting good hygiene by following the Golden Path of best practices is also necessary, while the knowledge and expertise of seasoned operations personnel are also critical to the success of operating global multicluster-based applications.

The application of generative AI to these challenges is right around the corner. Such technologies will simplify the work of operators, enabling them to be more productive in the face of massive scale. But regardless of the power and sophistication of such AI, human knowledge and expertise will always be essential to success.


Intellyx LLC. ServiceNow is an Intellyx customer.

May 15, 2023
5 min read

Share this article

About the author

Jason Bloomberg

Jason Bloomberg

Read moreRead more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more

KubeCon North America 2022: A Retrospective

Adriana Villela, Ana Margarita Medina | Nov 7, 2022

Adriana, as a first-time KubeCon attendee, and Ana, as a four-time KubeCon attendee share their thoughts on KubeCon North America 2022

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems