Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

The origin of cloud native observability

Before software existed, there was observability. From its earliest inception, observability was about understanding how mathematical systems and scientific models worked based on their observable outputs. 

Despite observability’s current popularity in today’s hypescape, you can look it up on WikipediaWikipedia and software observability is still but a footnote. Technology pundits and vendors have co-opted the term from STEM disciplines, such as control theory in mathematics, when in reality, observability for complex software architectures will never reach 100% predictability.

But maybe that’s ok. In highly distributed cloud native deployments, there is clear value to be gained through system-wide observability that cannot be captured from related technologies like software testing, monitoring, or simulation alone.


Growing beyond the scope of monitoring

In earlier days, most of our software infrastructure consisted of proprietary systems. Monitoring production required either using built-in tools provided by a vendor, or painfully sifting through the "data exhaust" of inconsistent output logs or data streams. 

Alerts emerging out of opaque boxes afforded little visibility to what was happening in real time. Teams had a slow mean-time-to-discovery (MTTD) for desktop software and centralized systems, and usually responded to issues after customers would report functional and performance problems.

“Computer software is very different today than it was in the 90's. Even over the last ten years, there's been a huge shift,” said Austin Parker, Head of Developer Relations at Lightstep. “We moved away from tightly integrated and monolithic applications in data centers – where reserving capacity was up to you – to supporting mobile apps and anywhere access in cloud infrastructure.”

Software architectures started to become more service-based and dependent on on-demand cloud capacity. Common standards and open source components came to the fore, including low-level system metrics and monitoring agents.

Today, open telemetry, cloud native principles, machine learning, and applied statistical analysis have revamped monitoring once more, leading to its reinvention as cloud native observability.


Greater developer expectations

Almost every company that depends on digital capabilities, from scrappy startups to well-established enterprises, is betting on cloud native development for delivering some part of its software estate to meet agility and scalability goals.

For a relatively young movement, such widespread interest and adoption is unprecedented, and cloud native has created a flurry of change in the tools and skillsets needed to build and maintain software.

As software users, we’ve grown accustomed to app stores and SaaS solutions that automatically deliver updates so we’re always on the latest version. As professionals who rely on software, we’re also becoming intolerant of long waterfall delivery cycles with stage gates, code freezes, constant update exercises, and limited release windows.

As developers, and as operations and security teams, we’re expected to stretch outside of our old roles and wear all of these hats.

“We used to hire one group of people to write code, and other people to ship the applications, and other people to patch software vulnerabilities in production,” Parker said. “It’s no longer good enough for a developer to just write good software. Now, we have to run that software in millions of possible infrastructures, we have to release quickly, and make sure we divide workloads into microservices so they can deploy and scale independently.”

In cloud native development, even relatively new applications can have thousands of microservices' dependencies and APIs, and highly distributed teams responsible for building and operating software, wherever the containers and Kubernetes pods are running.


Cloud native observability, now more than ever

Gone are the days when we could simply test software on a staging environment of the actual server it was going to be deployed on, and look at that server’s system metrics under load to see how it would perform in front of users.

Testing software on a shifting surface of APIs and ephemeral Kubernetes clusters makes visibility into future performance even cloudier for developers. Testers can try validating the functionality of software, but service calls with dynamic data cause procedurally captured tests to become brittle and break whenever something changed, costing the dev team unnecessary time.

Cloud native observability needs to model the flow among microservices, so today’s DevSecOps professionals can make sense of a deluge of potential data being streamed by hundreds or possibly thousands of nodes that appear when needed and disappear as soon as they aren't.

Observability data consists of logs, metrics and traces – collectively referred to as telemetry. Fortunately, pioneering contributors (from Lightstep founding engineers and other peers) are also evolving the generally available OpenTelemetryOpenTelemetry (or OTel) project, under incubation within the Cloud Native Computing Foundation (CNCF). Open source standards like OTel allow software vendors to generate and collect tool-agnostic telemetry data that is portable across multiple solutions.

“It’s not about the sources of telemetry – it’s how you use that data,” says Parker. “We want to shift observability left, so when we’re writing application code, we’re also writing monitoring and tracing code that gets used by automated testing and observability tools, so we can use telemetry data to enforce the desired end state service level indicators (SLIs).”

Tracing toward success at GitHub

With projects and code for more than 65 million developers and three million organizations on its platform, GitHub’s transition from monoliths to microservices and OpenTelemetryGitHub’s transition from monoliths to microservices and OpenTelemetry-driven observability seemed like a heavy lift.

However, by standardizing semantic definitions when building out their own library of OpenTelemetry and OpenTracing monitors, GitHub’s globally distributed organization was able to turn the tide on their modernization efforts, achieving a common reference about performance signals that all dev groups could embed into Ruby code to gain visibility into customer-facing issues.

By looking at OpenTelemetry’s cloud-native-ready monitors, they were able to connect end-to-end requests through streams of traffic data that connected their live SaaS application and its containerized infrastructure. 

By zeroing in on just the semantically relevant traces, GitHub was able to gain insight to debug and resolve a critical traffic latency issue, while preparing for future optimization on their journey to cloud native observability.

“Eventually this practice becomes motor memory,” said Parker. “You can move tracing to only the most high importance user flows that you really want to trace. Or focus the effort on application performance and tight monitoring of any place the team makes a change, whether eBPF kernel level tracing or a global CDN to see how edge caching is working.”


The Intellyx Take

Most organizations can’t afford the risk of waiting on cloud native observability when they are modernizing toward cloud native development efforts.

While it’s still healthy to test early and often, we must respect the fact that pre-production test servers will never work exactly the same as production environments. That’s why shifting observability left as well makes sense.

When done right, cloud native observability and open telemetry standards simply become a daily part of how an organization delivers software with ever-increasing levels of performance and agility. 

© 2023 Intellyx, LLC. At the time of writing, Lightstep from ServiceNow is an Intellyx customer.

January 23, 2023
6 min read
Observability

Share this article

About the author

Jason English

Jason English

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

KubeCon North America 2022: A Retrospective

Adriana Villela, Ana Margarita Medina | Nov 7, 2022

Adriana, as a first-time KubeCon attendee, and Ana, as a four-time KubeCon attendee share their thoughts on KubeCon North America 2022

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems