Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

Your system is deeper than you think

Trying to parse my career trajectory from reading my resume is like relying on a pirate’s riddle to find treasure. Instead of “walk four paces from the crooked tree, west of skull rock,” though, it’s “spend eight months dropping frozen chicken tenders into a deep fryer, then move eight hundred miles to write automated QA tests for software.” 

But there’s one constant in all the varied jobs I’ve had: they are defined by systems that are largely outside of our control, but that we’re ultimately responsible for

The person who prepares your food has about as much control over the process that delivered them the ingredients as the on-call developer has — getting paged at 3 a.m. because some tangentially related system on the Internet has broken. It’s the people, though, who are ultimately responsible for the immediate crisis, and who bear the burden of fixing the immediate problem.

Even Simple Systems Are Complex

The thing that these systems all have in common is that they’re largely more complex than it seems from the outside. Similar to an iceberg, where a small precipice breaking through the waves can hide a massive chunk of floating ice below the waterline (at least until climate change gets really going and we’re all living in Gas Town waiting for our turn in Thunderdome Plus), the systems that we’re responsible for in our professional lives are often more massive and cumbersome than they appear at first glance. Why is this? Some of it is easily explained, some more difficult, but the usual answer is quite simply ‘inertia’. This organizational and technical inertia isn’t simply something that affects large technical companies, however — it’s something that touches nearly everyone working with software today.

Amazon AWS had a power failure, their backup generators failed, which killed their EBS serversl, which took all of our data with it. Then it took them four days to figure this out and tell us about it.Reminder: The cloud is just a computer in Reston with a bad power supply.

— Andy Hunt (@PragmaticAndy) September 3, 2019September 3, 2019

One form of this hidden depth is, of course, other people’s computers, better known as ‘the cloud’. Cloud services have revolutionized the software industry, giving us access to infinitely scalable hardware resources, durable managed services, and a whole host of convenient ways to accidentally delete your entire cloud stack because someone fat-fingered a terraform apply. I’ve seen that last one happen: a new team member at a prior job accidentally deleted every resource in our AWS account due to a combination of poor internal documentation, flawed pairing practices, and overly permissive IAM roles. 

But, even if you’ve built an internal system that’s resilient to human beings, how confident are you that every external service you rely on is also so circumspect? You may have a small application with only a handful of services (or even only one service),  but every external API you rely on, be it from a cloud provider, or some other SaaS product, is a potential point of complexity, failure, and hair-rending frustration when it goes down and takes your application with it.

Hey npm users: left-pad 0.0.3 was unpublished, breaking LOTS of builds. To fix, we are un-un-publishing it at the request of the new owner.

— Laurie Voss (@seldo) March 22, 2016March 22, 2016

Your Dependencies, Their Dependencies, and (Lest We Forget) What’s Dependent on Them

You probably didn’t write your own HTTP stack, or networking stack, or even string comparison library. While it’s easy (and cheap) to go after left-pad or other, similar stories (such as Docker migrating their Go packages to a new GitHub organization, before gomod), the biggest threat to the performance and security of your application may simply be bugs in your third-party dependencies. 

Either through benign logic errors, or malicious intentmalicious intent, every module you import is a potential landmine and a source of complexity that you have little control over. 

As open source becomes more integral to the art of software development, the potential impact becomes even more widespread — you may vet all of your dependencies, after all, but are the authors of your dependencies taking that same level of care with their dependencies? This, of course, isn’t simply something you need to consider with direct code dependencies — your CI system, your package and container repositories, your deployment pipeline — these are all uncontrolled sources of entropy and complexity that you need to contend with.

We replaced our monolith with micro services so that every outage could be more like a murder mystery.

— Honest Status Page (@honest_update) October 7, 2015October 7, 2015

While we often think about our software systems strictly in terms of technical depth and complexity, it’s important to remember the organizational and human systems that underpin them. 

I think most people can relate to the feeling of helplessness that comes from fighting endless battles against organizational dynamics that don’t seem to make a lot of sense or have misaligned priorities. Maybe your organization rewards new features, while maintenance work is seen as “less important”. Perhaps you’re trapped in a ‘sales-driven development’ pattern where your work shifts from project to project, relentlessly adding new checkboxes to a system without a lot of concern for the overall scalability or maintainability of the application? 

Long-term vendor contracts can tie us to particular pieces of technology, forcing hacks and workarounds to development. There are a million other pieces of organizational detritus that float around our teams, regardless of the size or complexity of the actual software we work on. This hidden depth is possibly the most pernicious, as it’s difficult to understand how you can even begin to tackle it, but it’s still a source of complexity as you build and maintain software.

Burning(Out) Man: We’ve All Been There, Some of Us Are Just More Vocal About It

Unfortunately, we don’t have the luxury of sitting back and saying “eh, we’ll fix it tomorrow” when it comes to addressing these issues. You may have a brilliant team of developers, designers, PMs, and more — but you can’t afford the human costs associated with unreliable software. Posts about burnoutburnout

The stress-induced by trying to debug and analyze failures, especially those that aren’t in services or external systems under your direct control, can contribute to burnout. Teams that are suffering from burnout find themselves in situations that can rapidly escalate out of control, and the consequences can be dire — as failures pile on and multiply, more and more time is spent firefighting rather than dealing with the root cause of failures, which leads to more failure and late-night pages.

The Truth About Deep Systems

We’ve been talking a lot about deep systemsdeep systems recently here at Lightstep, and I’d encourage you to read some of that material and think about it in the context I’ve presented here. 

In short, deep systems are architectures where there are at least four layers of stacked, independently operated services, including cloud or SaaS dependencies. Perhaps a better way to think about deep systems is not so much an explicit definition, but rather what they “sound like”:

  • “I’m too busy to update the runbook.”

  • “Where’s Chris? I’m dealing with a P0, and they’re the only one who knows how to debug this.”

  • “We have too many dashboards!”

I’ve spoken with a lot of developers who think that their system isn’t ‘deep’ because they’ve only got a handful of services, or because they’re a small team. I’d argue that this isn’t the case at all — as demonstrated above, there’s an awful lot of ways your system can have hidden depth that contributes to stress, burnout, and unreliable software. The solution isn’t to despair, but it’s to embrace observabilityobservability as a core practice of how you build and run software.

This way, when something breaks, you’ll have the context you need to understand what happened, who is responsible, and how to best resolve the issue — even if the regression or error is deep in the stack or the result of a third-party dependency.

Interested in joining our team? See our open positions herehere.

January 6, 2020
7 min read

Share this article

About the author

Austin Parker

Austin Parker

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems