Serverless Still Runs on Servers

Serverless is important, and we should all care about it, even though it’s a silly buzzword: like NoSQL before it, the term describes “negative space” – what it isn’t rather than what it is. What’s serverless? A cat is serverless, donuts are serverless, and monster trucks are serverless.

What does serverless really mean in practice? It means that your code and the infrastructure it runs on are completely decoupled – if there are VMs or containers or punch card machines somewhere, you don’t need to know about them. There’s no notion of an “instance” or paying for an instance for that matter. It’s Functions as a Service.

In addition to potentially lowering resource costs, serverless is also an attractive alternative to other architectures, as programmers don’t need to worry about provisioning and managing these resources. Some organizations are even considering skipping microservices and moving straight to a pure-serverless architecture.

Of course, these serverless functions still run on servers, and that has real consequences for performance.

Serverless Still Runs on Servers - Latency Numbers
Source: https://gist.github.com/hellerbarde/2843375

In his recipe for becoming a successful programmer, Peter Norvig reminds us that “there is a ‘computer’ in ‘computer science.'” Likewise, there is a “server” in “serverless.” Norvig’s point is that while we can reason abstractly about performance in terms of asymptotic behavior, the constants still play an important role in how fast programs run. He urges programmers to understand how long it takes to lock a mutex, perform a disk seek, or send a packet from California to Europe and back (see figure).

Two numbers about the performance of servers are especially relevant to serverless:

  • A main memory reference is about 100 nanoseconds.
  • A roundtrip RPC (even within a single physical datacenter) is about 500,000 nanoseconds.

This means that a function call within a process is roughly 5,000 times faster than a remote procedure call to another process. (Thanks to Peter and Jeff Dean for calling these out.)

Serverless is often promoted as a way of implementing stateless services. And while there can be an opportunity there, even stateless functions need access to data as part of their implementation. If some of these data must be fetched from other services (which happens more often than not with something that’s stateless!), that performance difference can quickly add up.

One solution to this problem is (of course) a cache, but a cache requires some persistent state across requests, and that’s the antithesis of a serverless model. While these services might be stateless in terms of their functional specification and behavior, state can open the door to important optimizations. And when you add state to a serverless function, it starts to look a lot like a microservice.

Serverless has its place, especially for offline processing where latency is less of a concern, but often a small amount of context (in the form of a cache) can make all the difference for performance. And while it’s tempting to “skip” ahead to a serverless architecture, this might leave you in a situation where you need to step back to an orders-of-magnitude more performant services-based one.

DevOps and Site Reliability Engineering – What’s Different at LightStep

At LightStep, we spend every day helping our customers understand performance behavior in their distributed applications. We’re proud our product is used to diagnose problems for many important software systems. And as a tool used to improve performance and reliability in other applications, we must hold our product to even higher standards when it comes to those metrics. At the same time, we challenge ourselves to innovate quickly while still meeting (or exceeding) those standards.

As one of the co-founders and the CTO at LightStep, I’d like to share a bit of what it’s like to work on the engineering team, how we collaborate, and our process for bringing ideas to market.

One critical part of running highly available services is determining who is responsible for making sure that those services are available. Two related terms that get tossed around a lot here are DevOps and Site Reliability Engineering (SRE). Unfortunately, neither of these terms are particularly well defined – just Google them and see for yourself!

One of the parts of DevOps that I like best (though certainly not the only part) is that individual teams are responsible for the entire application lifecycle, from design, to coding and testing, to deployment and ongoing maintenance. This gives teams the flexibility to choose the processes and tools that will work best for them. However, that autonomy can lead to fragmentation across the org in how services are managed and duplication of effort across teams.

On the other side, SRE is often used to describe organizations that are laser-focused on product availability, performance, and incident response. While these are all important, these SRE organizations can sometimes build antagonistic relationships with the rest of engineering where SRE is seen as impeding progress for the sake of its own goals.

At LightStep, we believe in a hybrid implementation of these two philosophies, where our engineers are organized into small groups with split responsibilities but shared objectives. SRE at LightStep is responsible in part for building shared infrastructure that is leveraged by the whole organization, but they are also embedded within teams to help spread best practices and understand current developer pain points. This structure has enabled our teams to remain agile, to conduct rapid product experiments, and to have the flexibility to quickly adopt new (or discard old) technologies and tools. Retaining the natural and healthy tension between maintaining product stability and accelerating innovation to market ensures every decision we make is a balance that ultimately focuses on our customers’ success.

When considering prospective DevOps engineers or SRE (titles don’t really matter much to us at LightStep), we look for engineers who are excited about working side-by-side with the rest of our team. To us, SRE isn’t a separate organization so much as a mindset: we look for engineers who are excited to collaborate and apply a broad set of tools – including traditional operational tools like automation and monitoring as well as robust software development practices – to improve the reliability of our product and increase the velocity of individual teams and of our organization as a whole.

We’re always striving to improve how we do things and looking to new team members to help us on this journey. All of our engineers bring complementary skills and experience from both academia and industry. Above all, we value those who respect differing opinions, communicate clearly, and are empathetic towards their peers.

If you’d like to be part of this journey and would enjoy working on these engineering challenges, we’d love to hear from you!

Using a Mystery Shopper: Discovering Service Interruptions in Monitoring Systems

Many retail stores use mystery shoppers to assess the quality of their customer-facing operations. Mystery shoppers are employees or contractors that pretend to be ordinary shoppers, but ask specific questions or make specific complains and then report on their experiences. These undercover shoppers can act as a powerful tool: not only do organizations get information on their employees’ reactions, they don’t need to depend on ordinary shoppers to ask the right questions.

At LightStep, we faced a similar problem: we wanted to continuously assess how well our service is monitoring our customers’ applications and to identify cases where they are failing to meet their SLAs (or more properly, their SLOs). However, being an optimistic bunch, we don’t want to rely on our customers applications to continuously fail to meet their SLAs. 🙂 We needed another way to test whether or not LightStep was noticing when things were going wrong.

Who watches the watcher?

Who watches the watcher?

To provide some context, LightStep is a reliability monitoring tool that builds and analyzes traces of distributed or highly concurrent software applications. (A trace tells the story of a single request’s journey from mobile app or browser, through all the backend servers, and then back.) As a monitoring service, it’s critical that we carefully track our own service levels. Part of our solution is what we call the Sentinel. From the point of view of the rest of LightStep, the Sentinel looks just like any other customer. Unlike our real customers, however, the Sentinel’s behavior is predictable, and it is designed to trigger specific behaviors in our service. (We named it the “Sentinel” both because it helps keep watch on our service, but also because it creates traces with the intention of finding them later, and so it’s similar in spirit to a sentinel value.)

Designing the sentinel

To understand what the Sentinel does, you’ll first need a crash course on LightStep: as part of tracing, every component in an application (including mobile apps, browsers, and backend servers) records the durations of important operations along with a log of any important events that occurred during those operations. It then packages this information up as a set of spans and sends it all to LightStep. There, each trace is assembled by taking all the spans for an end-user request and building a graph that shows the causal relationships between those spans. Of course, assembling every trace would be expensive, so choosing the right set of traces to assemble is an important part of the value that LightStep provides.

Distributed call graph (showing connections between components)
and a trace showing the timing of one of these calls.
Click it to see magic!

In designing the Sentinel, we first identified two important features of LightStep: assembling traces based on request latency and alerting our customers when the rate of errors in their applications exceeds a predetermined threshold. To exercise these features, the Sentinel generates two streams of data. The first is a kind of background or ambient signal: a set of spans that represent ordinary, day-to-day application requests. We ensure that the latencies of these spans test the limits of our high-fidelity latency histograms, and, most importantly, we check that the number and content of the assembled traces matches our expectations.

The second stream of spans represents a set of applications errors. This stream periodically starts and stops, and each batch of errors exceeds the SLA threshold and causes an alert to trigger. Moments later, after the batch ends, the alert becomes inactive. On and off, on and off, all day long, these spans trigger alerts, and we verify each one.

The Sentinel has helped us discover incidents that other monitoring tools haven’t and avoids spurious alerts that might have been caused solely by changes in our customers’ behavior. We’ve found the Sentinel to be a particularly powerful technique when used in combination with a load test. While the load test simulates an unruly customer, the Sentinel acts as a well-behaved one. Using them together means that we can verify that one doesn’t interfere with the other.

Comparison to other monitoring techniques

Why not just use a health-checking tool like Pingdom? Of course, we use tools like those as well, but we’ve found that the Sentinel enables us to test more complicated interactions than off-the-shelf health-checking tools. Assembling traces from complex applications can be… well, complex, since spans from even a single trace can come from different parts of an application and may arrive out of order. No single span has the complete picture of what’s happening: in fact, the point of assembling a trace is to show the complete picture! Another way of saying this is that the correctness condition for trace assembly is defined globally: only by considering many different API requests (and their results) can we say whether or not a trace was assembled correctly.

Isn’t this all just an integration test? In a way, yes, but we see integration testing as a way of validating that our code works, while our online monitoring, including the Sentinel, ensures that our service continues to work. We explicitly decided that we wouldn’t try to use the Sentinel to cover all of LightStep’s features. While coverage is important for integration testing, we wanted the Sentinel just to test the most important features and components of LightStep and to test them continuously. Picking a subset of features helps us keep the Sentinel simpler and more robust.

When to use your own mystery shopper

The Sentinel acts a mystery shopper, letting us carefully control the input to LightStep and validate the results. You might find a similar technique is valuable, especially if the behavior of your service can’t be tested with a single API request and where there are complex interactions between requests, including time dependence or the potential for interference with other systems.

For example, if you have a product that includes some form of user notification, you might want to test the following sequence:

  1. Set up a notification rule
  2. Send a request that triggers the rule
  3. Check that the notification is sent

Continuously exercising this sequence can give you confidence that your service is up and running. Just don’t forget to remove the notification rule so that it can be tested again!

As in the case of any testing or monitoring, think about what matters to your users. What features do they depend on most? Just as a retail store manager can hire a mystery shopper to ask the right questions, you should use monitoring tools to verify that your most important features are working to spec.

Want to chat about monitoring, mystery shoppers, or SLAs? Reach us at hello@lightstep.com@lightstephq, or in the comments below.

The End of Microservices

A post from the future, where building reliable and scalable production systems has become as easy as, well, writing any other software. Read on to see what the future looks like…

Back in 2016, people wrote a lot about “microservices,” sort of like how they wrote a lot about the “information superhighway” back in 1996. Just as the phrase “information superhighway” faded away and people got back to building the internet, the “micro” part of microservices was also dropped as services became the standard way of building scalable software systems. Despite the names we’ve used (and left behind) both terms marked a shift in how people thought about and used technology. Using services-based architectures meant that developers focused on the connections between services, and this enabled them to build better software and to build it faster.

Rise and Fall of the Information SuperhighwayThe rise and fall of the information superhighway (source)

Since 2016, developers have become more productive by focusing on one service at a time. What’s a “service”? Roughly, it’s the smallest useful piece of software that can be defined simply and deployed independently. Think of a notification service, a login service, or a persistent key-value storage service. A well-built service does just one thing, and it does it well. Developers now move faster because they don’t worry about virtual machines or other low-level infrastructure: services raise the level of abstraction. (Yet another buzzword: this was called serverless computing for a while.) And because the connections between services are explicit, developers are also freed from thinking about the application as a whole and can instead concentrate on their own features and on the services they depend on.

Back in the day, many organizations thought that moving to a microservice architecture just meant “splitting up one binary into 10 smaller ones.” What they found when they did was that they had the same old problems, just repeated 10 times over. Over time, they realized that building a robust application wasn’t just a matter of splitting up their monolith into smaller pieces but instead understanding the connections between these pieces. This was when they starting asking the right questions: What services does my service depend on? What happens when a dependency doesn’t respond? What other services make RPCs to my service? How many RPCs do I expect? Answering these questions required a new set of tools and a new mindset.

Tools, tools, tools

Child with Robot

Building service-based architectures wouldn’t have been possible without a new toolkit to reconcile the independent yet interconnected nature of services. One set of tools describes services programmatically, defining API boundaries and the relationships between services. They effectively define contracts that govern the interactions of different services. These tools also help document and test services, and generate a lot of the boilerplate that comes with building distributed applications.

Another important set of tools helps deploy and coordinate services: schedulers to map high-level services to the underlying resources they’d consume (and scaling them appropriately), as well as service discovery and load balancers to make sure requests get where they need to go.

Finally, once an application is deployed, a third set of tools helps developers understand how service-based applications behave and helps them isolate where (and why) problems occur. Back in the early days of microservices, developers lost a lot of the visibility they were accustomed to having with monolithic applications. Suddenly it was no longer possible to just grep through a log file and find the root cause: now the answer was split up across log files on 100s of nodes and interleaved with 1000s of other requests. Only with the advent of multi-process tracing, aggregate critical path analysis, and smart fault injection could the behavior of a distributed application really be understood.

Many of these tools existed in 2016, but the ecosystem had yet to take hold. There were few standards, so new tools required significant investment and existing tools didn’t work well together.

A new approach

Vintage Flying Machine Drawing

Services are now an everyday, every-developer way of thinking, in part because of this toolkit. But the revolution really happened when developers started thinking about services first and foremost while building their software. Just as test-driven development meant that developers started thinking about testing before writing their first lines of code, service-driven development meant that service dependencies, performance instrumentation, and RPC contexts became day-one concerns (and not just issues to be papered over later).

Overall, services (“micro” or otherwise) have been a good thing. (We don’t need to say “microservices” anymore since, in retrospect, it was never the size of the services that mattered: it was the connections and the relationships between them.) Services have re-centered the conversation of software development around features and enabled developers to work more quickly and more independently to do what really matters: delivering value to users.

Back in the present now… There’s still a lot exciting work to be done in building the services ecosystem, and here at LightStep, we are excited to be part of this revolution and to help improve visibility into production systems through tracing! Want to chat more about services, tracing, or visibility in general? Reach us at hello@lightstep.com@lightstephq, or in the comments below.