Distributed Tracing: Why It’s Needed and How It Evolved
by Austin Parker
When you think about “tracing,” what do you think of? It’s one of those words in software development that is overloaded with meaning. Some people may think of a “stack trace,” the familiar blob of text issued by an application runtime that shows each function call preceding a point in the code where an exception occurred. Other people may go a step further, and think of the action of tracing — navigating through logs from different services and computers, literally tracing the path of a request as it moves through a system. Conveniently enough, “distributed tracing” is really a combination of both of these processes — you can think of it as the “call stack” for a distributed system, a way to represent a single request as it flows from one computer to another.
Now, if this doesn’t make a ton of sense, that’s normal — I just dropped a lot of terminology on you! This series is going to demystify distributed tracing, starting from the basics. Today, I’m going to talk about distributed systems — what they are, why we use them, and why the rise of distributed systems has made tracing so important.
Twenty years ago, the way we used software was very different than the way we do today. We didn’t have “the cloud,” and the internet itself was a nascent technology. That said, since the 1970’s, a new type of software system was being developed — the distributed system. Now, the idea of a distributed system wasn’t new, per se — but by the 1970s computer technology had advanced to the point where they were feasible. In a distributed system, computers can act as both clients and servers, allowing for tasks to be performed on different machines. These systems can leverage economies of scale, allowing for large quantities of messages or data to be stored on a central server, which can be accessed by lightweight clients over a network. The servers take care of the “heavy lifting” of processing the data, while the clients simply make requests for what they need. This basic idea led to more codified forms, such as a three-tier or n-tier architecture, or even peer-to-peer architecture, where an “application” spread out into more independent services, working in concert with each other to satisfy a user’s request.
A note on architectures, tiers, and layers: Formally, “tiers” and “layers” are not substitutable; a tier refers to a discrete, physical unit, whereas a layer is a logical group of software components. That said, the two terms are often used interchangeably in conversation.
As high-speed internet access became more prevalent throughout the United States and the rest of the world, software architecture changed with it. Rather than specialized client software on home computers, web browsers began to act as an interface to more complex server applications running in remote data centers. These server applications, in turn, began to grapple with a problem — scaling. Not the kind of scaling you do trying to climb a wall — although, I’m sure that many programmers were driven up the wall trying to bring more capacity online! Scaling an application under load can be challenging, depending on how it’s designed. If your application is stateful (as in, it maintains some sort of “user state” in-memory that needs to exist for a long period of time), then it can be extremely difficult to add capacity — especially when you need more memory, storage, or CPUs that can only be obtained by physically buying and installing more servers. These challenges led to changing techniques: creating stateless services, and breaking them into smaller units of functionality. If you’ve heard of a “service-oriented architecture” (or SOA), this is where it came into its own — being able to split up a service into different parts, communicating with each other over a network, made it possible to more easily scale your application in response to demand.
It is into this world that distributed tracing found its purchase. If your application is split across many individual servers, you need a way to understand the behavior and performance of that entire system, rather than just its individual parts. The failure mode of your application changes — an individual service crashing may result in unexpected or unexplained behavior in a completely different part of the system. When these failures occur, you need more than just a stack trace logged to the offending machine — you need to be able to see the entire request, from beginning to end. Developers came up with a lot of different solutions to this problem — centralized logging, remote debugging, and a variety of other tools to aid in diagnosing problems with distributed systems. Over time, though, the problems continued to compound. Applications became more complex, more distributed. New deployment platforms and tooling — virtual machines, containers, Kubernetes — made it easier to create more complex applications, with more moving parts. The cloud made it possible to easily provision new infrastructure and scale it around the world, and all of this led to even more complexity and confusion.
Let’s look at this in a bit more detail.
Conway’s Law states that, loosely, a system will mirror the way an organization is structured. As software organizations become more complex, naturally their applications will as well. If you work for a very large company, this should be apparent — you may work on a small part of a much larger system that is expected to work in concert with other services written by developers across the country, or even around the world. This makes it challenging to understand how a failure in your service impacts other services — or vice versa.
Developers, broadly speaking, don’t want to be tied down to a single language or technology. Some of this is a result of organizational dynamics — for example, integrating a team or product from an acquired company — and some of it is due to the rapidly changing nature of the software industry. New languages like Go and Rust continue to gain adoption and favor with developers wanting to write high performance, maintainable code. Typescript, Python, and other dynamic languages, offer benefits of their own to different developers — especially those working in data science. Web developers have seen exponential growth in the sophistication of their tools as well, as the browser becomes the predominant application runtime. Everyone has different needs and wants to use different tools.
Small teams, too, are feeling the pain of distributed systems. Even with a single language and a relatively small application, the rise of “cloud native” software that is designed to be built as a collection of smaller services that rely heavily on external deployment systems and APIs for functionality, puts developers in the pinch of not being able to know what broke and why. This is exacerbated by the velocity of a small team — if you’re deploying new releases of your software, multiple times a day, then you need immediate information and live traceability of what’s happening in production.
One solution to this constellation of complexity is distributed tracing — specifically, distributed tracing built for cloud native, polyglot applications. However, what is distributed tracing? Why do we need it? What are, exactly, the problems caused by these distributed systems? In the next part of this series, I’ll cover the issues that distributed systems can lead to, and why distributed tracing is the backbone of understanding how our systems function.
Be sure to also check out our complete guide on distributed tracing!
Interested in joining our team? See our open positions here.