Why Working on Monoliths is Bad for Your Career

It’s a Tuesday morning and it’s time for another standup. It’s the 3rd time your company has tried to “become agile.” Your “Scrum Master” (and manager) is assigning tasks for the day and checking on why your team has been slipping the next feature release for the last month. Changing 10 modules across the monolith without breaking everything has turned out to be more challenging than expected (well, at least more challenging that management expected: You’d given your honest estimates and had them discounted by 30% because you were “clearly sandbagging” right?) You’re starting to wonder about this whole software development thing. Let me help you out: working on monoliths is bad for your career in the following ways.

Working on Monoliths is Exciting in the Wrong Ways And Boring in the Wrong Ones Too

Deploying a new feature in a monolith is an exciting time not because the feature is great but because who knows what’s really going to happen when customers use it? Sure, you’ve got tens of thousands of tests. You have unit tests, systems tests, acceptance tests, smoke tests, but again and again, “surprising” things have happened, like when you didn’t figure out that database writes were silently failing for four hours, or that invoicing for a third of your customers was broken for a couple of months. So, deploys are exciting (in a dreadful kind of way).

Getting a new feature out to meet a customer need certainly starts off as exciting: You’ve heard what they needed and you want to make sure it makes the next monthly deploy. But then the planning process starts, and you discover that you’ll need to coordinate across three teams, best case, to make sure your changes don’t conflict with changes they’re planning to make — a recipe for code review and merge hell. Then comes actually making the changes… No one’s tried to use the data this way before and the object model dependencies are, to put it nicely, daunting. You miss the first deploy, then the second. The excitement of delivering something the customer needs slowly turns into the day-to-day grind of making another change, finding the tests that break after the test suite finishes an hour later, figuring out whether the issue is the tests or code, making another change, waiting another hour. It becomes boring.

A microservices environment, or even service-oriented architecture, is the opposite. Deploys are boring (on purpose). You did three today, you’ll do five tomorrow. Maybe you’ll have to rollback one or two, makes some fixes, and deploy again, but the process of getting features in front of customers is boring. The exciting thing is that you can hear a customer request, look at the code necessary to make the change, and deploy the change to get feedback in a couple of days if not hours.

Getting used to and internalizing the monolith cycle will make it a lot harder to even understand the microservices cycle, much less successfully get hired at the companies you’d like to work at.

Working on Monoliths You’ll Miss Out on Modern Tooling and Techniques

Why Working On Monoliths Is Bad For Your Career - 2


Working on a monolith, you’ll get very familiar with object inheritance, especially whichever approach your current architects favor. You’ll probably learn a lot about debugging from logs. Perhaps enough about databases and SQL to be truly dangerous. You may learn about some cloud APIs and how to use their clients, perhaps a little bit about metrics. You will likely not learn about RPC frameworks, CI/CD, or distributed tracing.

When the question comes, either in a design meeting or an interview, about how to scale a system, you will not have the context nor the experience to answer it.

Working with microservices, you might end up less familiar with object inheritance or SQL, but you will learn how to rapidly and safely deploy software that spends time communicating with other software over unreliable networks. You’ll learn about thundering herds and circuit breaking. You’ll be able to talk about what’s involved in scaling a feature to 1000x the users. And that will open up many other opportunities not only for success in your career but also for continued growth.

Working on Monoliths Means Working at Businesses That Have Had or Will Have Limited Success

I’m not going to tell you there aren’t successful businesses that are built on monoliths. I will tell you that being on a monolith has or is going to limit their success. The rate of change in a monolith is so much slower that if a competitor comes along and they’re able to create features at (conservatively) 3x the rate you’re able to, there will be business problems. The longer the business has been on a monolith, the more they’ve embedded how they think about what they sell, how they sell it, how it can be supported, into the object model and the development process. Every month, every year, it will be harder for them to change to stay ahead of, or even catch up to competitors. If or when the decision comes to split the monolith, be ready for years of the exciting/boring cycle and a high likelihood that you’ll end up with a distributed monolith instead of microservices.

Working in a microservices environment with (actual) microservice-sized codebases means that, if you need to, you can literally rewrite your business over the course of a couple years. It’s been done. You’ll move faster than your competitors, and, if you’ve hired a good platform / SRE team, with better availability too. Being able to adapt to changing business conditions, both technologically and organizationally, simply but dramatically increases the likelihood of the business’ success.

Working on Monoliths Limits Your Growth

Why Working On Monoliths Is Bad For Your Career - 4


When you’re working on the same type of thing all the time, when you’re spending most of your day waiting for tests to run instead of understanding whether you’re meeting customer needs — your growth will be limited. As a team lead, maybe you’ll own a module, or, as an architect, an initiative, but you’ll rarely have the opportunity and challenge to take a perceived need and turn it into a working and highly scalable service. You won’t get to feel that sense of ownership, of success, of confidence, knowing that you have accomplished a serious piece of work and that you can do it again.

When you’re working with microservices, there will certainly be many opportunities to lead the development of a completely new service, to conceive of an API, discuss it with users, refine it, implement it, and scale it. You’ll know that you have the skills, that you know the process, that you can teach it to others. You will grow and you will help others grow. You’ll create connections that will lead to the next thing, and the next thing after that. You will grow, because that’s fundamentally the orientation of the organization.

Conclusion

There are many good people working on monoliths. You may be one of them.

Microservices are not a magic dust of success, but, all things being equal, you will grow more, learn more, become a better developer, and be more successful at more successful businesses by choosing to work in environments embracing microservices.

Kubernetes Observability for Contrarians

So, maybe you’ve heard of this Kubernetes thing? In June of 2014 Google launched an ambitious effort to provide a best-in-class system for orchestrating container-based workloads as open source. Over the last 5 years, that effort has led to the collaboration of all major cloud providers and many major technology companies, a feat in itself. As Kubernetes continues to pick up steam as the best platform to run your cloud native applications (or, if you listen to Kelsey Hightower, to build your platform for running your cloud native applications) questions continue to be raised about how to make it observable.

Here’s the dirty secret of Kubernetes observability, it’s exactly the same as (effective/successful) observability for distributed systems in VMs or even on bare metal.

The Truth about the Customer Experience

Whether container, VM, or bare metal, you need to understand how well the users of your service are able to accomplish their goals, whether that’s placing an order for socks, arguing with people on the internet, or just looking at cats. In the era of the monolith, all the information about users success was available in a single process or on a single machine (and the load balancer in front of it). In the era of services, that information is spread across many machines, VMs, pods, or functions. To piece together the truth about customer experience and to place that in business context, you need to use distributed tracing. By tracking time spent across all of these different places where work can be done, we can answer questions like “why was this order slow?” or “why didn’t the cat picture show up?”

Kubernetes Observability: Context Is King

Observability on Kubernetes is necessarily observability for distributed systems, but that is not particular to Kubernetes, it’s just that, for those new to running applications this way, they’re forced to cope with the “no single machine with answers” problem.

When you look at what is different observing a distributed system on Kubernetes vs VMs, the main difference is context. Context is what allows you to correlate failure with causes, often contention for a shared resource, CPU, storage, network, or database connection. With VMs your context will likely be an instance id. As you schedule or “bin pack” more services onto a single VM, you need to add to the context so you can answer questions like is this failing because of an issue with this pod, this VM, this machine, this rack, this region or some other shared resource like a NAT gateway.

Kubernetes just adds more context to be included, usually as tags, onto your distributed trace, but it does not fundamentally change how or why you observe.

Kubernetes and Distributed Tracing

As much as I’d like to tell you that distributed tracing is magic dust that you spread across your applications to see what they’re doing, it’s a bit more complicated than that. The most important thing to know is that transactions through your system need to carry the distributed tracing context everywhere. For the usual HTTP request-based applications, this context is carried in standard headers. To make sure you get visibility throughout your system, even if a particular service doesn’t support distributed tracing yet, these headers need to be passed through. This initially surprises many people, why would applications need to change? Shouldn’t the distributed tracing system just know that they’re the same request by timestamp or whatever?

Thinking about it a bit more though, the answer becomes clear: When a request goes into a service and request comes out of a service, the process that handles that request is doing something, that’s why it exists. That something could be making multiple requests for a single inbound request, or it could be retrying a request for something that failed before. Trying to guess what’s happening in that black box may work when everything is fine but when things are failing, you want to know for sure that a request leaving that service was associated with a particular request to that service, all the way back to the customer. The point is to understand failure in business context — trying to do that by being lucky is not a strategy.

Getting Started: K8s Observability

So, how do you observe this newfangled Kubernetes thing? The same way we observe all the other distributed systems, through distributed tracing, associating the work done by services with the context needed to understand failure.

Still, this isn’t particularly helpful if you’re new to distributed systems, which is why the Istio project is particularly interesting. It’s a “batteries included” way of running a distributed application on Kubernetes. By checking out the distributed tracing functionality built into Istio, you can start to get a more intuitive sense of how all this works.

Development Time: The Only Cost That Matters

It’s the Thursday before a holiday weekend and you’ve got a cost crisis. Someone in finance has just noticed that this month’s AWS bill is trending 15% higher than last month’s. An all-hands meeting is called, and everyone is asked to shut down as much capacity as they can “safely.” All the work your team has been trying to push out before end of sprint is going to be delayed for days. Chances of an operational outage when someone shuts down something critical? Pretty high…

The impact of lost development time, lost customer feedback on new features, operational issues — these all pale before the long-term impact of making developers scared to develop and deploy.

Delay Is Its Own Cost

When you’re looking for product / market fit (and for a surprising amount of time after), time is the only cost that matters. Most developers have experienced a story like the one above. In the name of frugality, they agonize about whether to deploy another three instances that cost a few hundred dollars a month.

The reality is that the time they’ve spent debating about it, talking with others, and escalating to management has already wasted more than what they could be saving. And that’s just considering the time of the people involved. Once you look at opportunity cost — what could have been done instead — you could be looking at orders of magnitude of lost value.

Cost vs. Waste

What can we do to change this pattern? How can we preserve the ever-important value of development time — to get features in front of customers?

The first step is understanding the difference between cost and waste for cloud resources. Cost is what it takes to scale out a software developer’s time. It’s important to be clear that software developers are paid well precisely because their work can scale out with a cost less than their ongoing attention, that instead of each widget being built by them, each customer served by them specifically, instead we can pay for computing resources which will do the work following the (programmed) directions of developers. Waste is cost that is not being used at all. You’d think this wouldn’t happen, but it’s surprisingly common to be paying for a resource, storage or compute, that literally is not doing anything. That’s waste, and needs to be surfaced and reduced. Cost we can, and should, expect to grow with the business, ideally faster than personnel costs.

Value Your Development Time

The second step is to get developers to not think of costs as if they’re personal costs. Instead of “I wouldn’t spend $1500 of my own money without thinking about it for a long time,” developers should ask, “will spending $1500 get an answer to a business question a couple weeks earlier?”

If the answer is yes (and it often is), then there should be no hesitation or pushback on spending that money. You just bought yourself not only two weeks of development time, but also access to the information two weeks earlier. It is a deal. If you can get your developers thinking “how can I spend money to make better informed decisions?” you will start operating at a different level. Of course, it’s not just about developers having this mindset, but also their managers and directors. Everyone has to understand the value of information and time to the business, instead of thinking about everything as if it was a personal purchase decision.

Trade Money for Time, Every Time

So, the next time you’re about to declare or respond to a cost crisis, consider whether you’re concerned about cost or waste, whether that’s because you’re looking at it as a business or personal purchase, and what the opportunity cost of the time to address the crisis will be. Help your developers feel empowered to trade money for time. It will transform your business.

Istio Distributed Tracing: How to Get Started with LightStep and Kubernetes

LightStep Tracing is an easy way to start using distributed tracing without deploying your own distributed tracing system. Istio is a “batteries included” set of best practices for deploying and managing containerized software. Istio proxy provides an automatic service mesh, based on Envoy, so that you can understand and control how different services communicate with each other. Envoy and therefore Istio support distributed tracing out of the box.

For this walkthrough, we’ll be using Google Kubernetes Engine (GKE) for our Kubernetes cluster. We’ll assume you have a working gcloud CLI installation that includes kubectl as well (gcloud components install kubectl).

Creating a Cluster

Step 1: Create the cluster

export CLUSTER_NAME=lst-walkthrough # name of the GKE cluster
export PROJECT_NAME= # name of the project to create cluster in
export ZONE=us-central1-a # zone to create cluster in

gcloud container clusters create $CLUSTER_NAME \
--cluster-version latest \
--machine-type n1-standard-2 \ # istio-telemetry failed to schedule with n1-standard-1
--num-nodes 3 \
--preemptible \ # to reduce costs for proof of concept
--zone $ZONE \
--project $PROJECT_NAME

Step 2: Store credentials locally for kubectl

gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_NAME

Step 3: Grant our user cluster-admin role

kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Step 4: Create the Istio system namespace for the components

kubectl create namespace istio-system

Step 5: Download and unpack an Istio distribution

export ISTIO_VERSION=1.1.5
wget https://github.com/istio/istio/releases/download/$ISTIO_VERSION/istio-$ISTIO_VERSION-osx.tar.gz
tar xcvf istio-$ISTIO_VERSION-osx.tar.gz
cd istio-$ISTIO_VERSION

This is for OS X, if you’re using a different OS, refer to the Istio directions.

Step 6: Initialize Istio system certificates and Custom Resource Definitions (CRDs)

helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

Step 7: Check for completion of the creation of the CRDs

kubectl get crds | grep 'istio.io\|certmanager.k8s.io' | wc -l

In version 1.1.5, when complete this should print “53”

Step 8: Sign up for LightStep Tracing

  1. Navigate to https://go.lightstep.com/tracing.html
  2. Fill out the form
  3. Click on the link in the email to set up your account
  4. Go to Settings and copy the token

Step 9: Set up Istio with Helm

export ACCESS_TOKEN=""
helm template --set pilot.traceSampling=100 --set global.proxy.tracer="lightstep" --set global.tracer.lightstep.address="ingest.lightstep.com:443" --set global.tracer.lightstep.accessToken=$ACCESS_TOKEN --set global.tracer.lightstep.secure=true --set global.tracer.lightstep.cacertPath="/etc/lightstep/cacert.pem" install/kubernetes/helm/istio --name istio --namespace istio-system > $HOME/istio.yaml
kubectl apply -f ~/istio.yaml

Step 10: Create a cacert.pem file with the Let’s Encrypt Root CA

curl https://letsencrypt.org/certs/trustid-x3-root.pem.txt -o cacert.pem

The current version of the Istio LightStep integration requires a custom CA cert bundle to be specified. The public LightStep collectors at ingest.lightstep.com are from Let’s Encrypt.

Step 11: Add the file as a secret for the LightStep integration to use

kubectl create secret generic lightstep.cacert --from-file=cacert.pem

Step 12: Label your default namespace so that Istio will inject the Istio Proxy sidecar automatically

kubectl label namespace default istio-injection=enabled

Step 13: Wait for all pods to show as running (this can take a few minutes)

kubectl get pods --namespace istio-system

Step 14: Create the example BookInfo app and gateway:

kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml
kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml

Step 15: Capture the information necessary to access the BookInfo app locally and use open to load it in your web browser

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
export SECURE_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="https")].port}')

Step 16: Open BookInfo in your browser

open http://$INGRESS_HOST:$INGRESS_PORT/productpage

Viewing Distributed Traces

You can now go to https://app.lightstep.com/ and see traces for the example application.

Istio Distributed Tracing How To Get Started With LightStep And Kubernetes 1

Yay! A distributed trace! You can now see end-to-end transactions in your system. But, you may ask, how did it actually happen? It’s a common misunderstanding that when tracing with your service mesh, there aren’t any code changes. In fact, it’s necessary for services to pass through distributed tracing headers even if they are not participating in the trace. Let’s walk through those headers for Istio and where that’s implemented in the BookInfo sample application.

The Istio tracing documentation lists the following headers required to be forwarded:

  • x-request-id
  • x-b3-traceid
  • x-b3-spanId
  • x-b3-parentspanid
  • x-b3-sampled
  • x-b3-flags
  • b3

If using LightStep also:

  • x-ot-span-context

However, looking at the productpage service source, the only non-b3 header being forwarded is x-request-id. So, how is this working? Istio Proxy’s LightStep integration supports a special tag called guid. Since the x-request-id header is being forwarded and that header is being tagged as a guid on all the Istio Proxy spans, LightStep is able to infer the ordering and parentage of the spans with only that information. A change to the productpage source to properly forward the header needs to be made. For the other applications here are the places where the headers are captured and forwarded:

Details (Ruby) Captured Forwarded
Reviews (Java) Captured Forwarded

As you can see with the above list, there may be many headers to forward if you want to support Zipkin/Jaeger B3 headers, OpenTracing headers, and Istio Proxy (Envoy) headers. With the standardization of tracing headers with W3C Trace Context and OpenTelemetry this should be much simpler in the future.

Google’s June 2nd Outage: Their Status Page != Reality

Previously we’ve written about having hard conversations with cloud providers. On Sunday June 2nd, Google Cloud Platform had an extended networking-based outage. There was significant disruption of commonly used services like YouTube and Gmail, as well as Google hosted applications like Snapchat. The incident currently associated with the outage, 19009, indicates a start time of 12:53 US/Pacific and a resolution time of 16:56 US/Pacific. LightStep Research’s ongoing synthetic testing shows that the impact was longer than the advertised incident report and provides an example of the type of evidence you can share with a cloud provider when discussing an outage.

Summary of Findings

From 11:48 to 11:53, access between us-east1 to GCS regional buckets in us-east1 was completely disrupted. From 11:48 to 12:10 latency for at least 50% of requests was significantly higher from us-east1 and us-central1 to GCS regional buckets in us-east1, us-central1, and europe-west2. From 12:10 to 14:53 access was significantly slower for 5% or more of requests both inside and outside of the us-east1 region. From 11:48 to 12:03 latency was also elevated for europe-west2 to europe-west2 regional bucket access.

us-east1 Metrics

Requests From Us East1 To Us East1 LightStep 𝑥 PM

This screenshot is from the LightStep application’s historical view, Streams. Latency is shown on the top with lines for 50th, 95th, 99th, and 99.9th percentile (p50, p95, p99, p99.9). For the p50 line, this means that 50% of requests took more than the displayed time. Similarly for p95, 5% of requests took more than the displayed time. Below the latency graph, the request rate is shown. For this test, there are 50 requests made every minute, leading to the displayed rate of slightly less than 1 request per second. At the bottom is the error rate percentage, meaning the number of errors divided by the total number of requests.

This graph shows requests from a Google Cloud Function in us-east1 to a Google Cloud Storage regional bucket in us-east1. Following the start of the outage, there is an approximately 5-minute gap where no requests are successfully made. Relatively quickly, about 22 minutes after the start of the outage, p50 latency has recovered to the previous normal value. However, p95 latency does not recover until approximately 2 hours and 43 minutes after the p50 recovery.

Requests From Us East1 To Europe West2 LightStep 𝑥 PM

This graph shows a similar sequence of events for requests from Google Cloud Functions in us-east1 to a regional bucket in europe-west2. However, this does not show the gap in requests, suggesting that requests to europe-west2 would have been more likely to succeed than same region requests, an interesting finding.

Requests From Us East1 To Us Central1 LightStep 𝑥 PM

This graph shows request from us-east1 to us-central1. The recovery in this case is less clear and there appears to be a further, though less severe (affecting only p99 and p99.9), disruption at the end of the displayed time window.

us-central1 Metrics

Requests From Us Central1 To Us Central1 LightStep 𝑥 PM

This graph shows us-central1 to us-central1 same region request traffic. Though the GCP incident states that the disruption was in the east, the central region internally was impacted through most of the outage window.

Requests From Us Central1 To Europe West2 LightStep 𝑥 PM

Traffic to europe-west2 from us-central1 shows the same pattern.

Requests From Us Central1 To Us East1 LightStep 𝑥 PM

As expected, impact from us-central1 to us-east1 is more severe in terms of peak latencies. The time frame matches the other observations.

europe-west2 Metrics

Requests From Europe West2 To Europe West2 LightStep 𝑥 PM

This graph showing same region requests from europe-west2 to europe-west2 shows that latency was disrupted in an unrelated region, for a duration matching the p50 recovery in other regions. From this, we can see that “high levels of network congestion in the eastern USA” also had a much broader impact than just us-east.

Conclusions

Real time observability of the performance of cloud service APIs is necessary to have timely understanding of the range and size of impact an outage has on your organization. Status page updates will often be delayed by tens of minutes and will not include enough detail to be actionable. Reliable high resolution graphs of performance enable you to understand impact outside what is documented on the status page — and have the hard conversations you need with your cloud providers (as well as the data to support your case).

Migrating to Microservices? Here’s How to Have Reliable APIs from Day One

Starting the migration from monolith to microservices can be daunting. Still more daunting is to have spent a couple years on it and still not understand “what done looks like.” If you have an ORM-based monolith, there’s a strong temptation to do a data-first migration: to move a model or set of models into a CRUD service and then call it using HTTP instead of the database.

At first it seems like this is the easiest way to get to services and to “break the monolith.” The truth is that most often this path ends with a distributed monolith with tightly coupled models and APIs that, while tolerable, do not bring joy.

You can instead migrate using an API-first approach to create interfaces that you want to work with for years.

We Mean Business (Processes)

So, if not a data model-first migration, how do you start? Begin by asking what are well understood business processes both in terms of logic and the data which is transformed. For those processes, what methods exist to expose that data for transformation, and what other processes transform the same data? By implementing and testing the business process and potentially also the data stored as an API, you can see how working with it feels — or if it’s not quite the right way to look at the process of change in the system.

Keeping clear on which type of service and API you’re creating helps you make better tradeoffs in design and implementation. For example, a business process service concentrates the business logic for a set of transformations to data. A data store service concentrates data that is closely interrelated or generally changes together.

Fail Faster

You may find yourself creating data store APIs in your monolith, as for you it’s the easiest way to expose data for transformation with external business process APIs. That’s ok. You may find yourself using data store APIs from the monolith. That’s ok too — as long as the data store API is a designed interface, not just a copy of a model. (Implementing a “anti-corruption layer” will help you keep from being dependent on assumptions built into the current set of models.)

Migrating to Microservices? Here’s How to Have Reliable APIs from Day One - migrations compared

Taking an API-first approach forces the choice of what kind of service something will be. Instead of copying models across and then putting CRUD on top of them, you need to decide how to interact with the data first, what scope it has, which transformations you’ll want to expose and which you’ll want to hide. Then, more importantly, implement the API as quickly as possible and see how it is to use from the current clients or monolith.

I’m a fan of gRPC/Protobuf-driven design for quick prototyping with tools like Truss. Even if you don’t want to use Protobuf as an interface or gRPC as an RPC method, it can help you do API-first testing quickly. It’s always acceptable to “burn the prototype” and start over once you’ve validated the API design.

Figuring out in days — instead of years — that the API you originally designed is unwieldy when doing common activities is career changing. It’s life changing.

Separate and Validate

If the cost to set up a new service in your environment is high, you may end up combining business process and data store APIs on a single service, but I strongly suggest that you keep them distinct from a modeling and implementation standpoint. The business process part of the service should use the data store APIs to do data transformations, instead of going directly against the underlying data store. You want to be continually validating your APIs’ practical usefulness and refactoring if they fall short.

Perfect Is the Enemy of Good

Taking an API-first approach to microservices migrations allows you to learn faster, to develop APIs you want to work with, and gives a clear place to address data model / interface tech debt. Rapid prototyping and validation of assumptions about use and performance will save you months or even years of time.

Don’t try to design the perfect migration plan or even the perfect interface. Prototype, use, change, evolve. Building services this way gives you a taste of the future agility you’ll experience and starts you down that path.

How to Have Tough Conversations with Cloud Providers, Vendors, and Everyone Else You Are Paying

Error rates skyrocket. A critical service slows. SLAs breach by the dozen.

When things break — and one of your vendors is clearly at fault — even the most seasoned engineers can lose their cool.

“What the heck is going on! Why didn’t we know about this? Fix it! Fix it now!”

In situations like this, it’s easy to get upset. But, of course, this won’t help anyone.

 

You turn the screws,
You tear down the bridge,
Flimsy as it is,
It’s business like
— Cake, You Turn the Screws

 

Effective vendor relations are ultimately about the timely resolution of issues. Yet, this is only made possible through strong relationships, data-driven communications, and empathy on both sides.

When you build your business on other companies’ services (cloud storage, APIs, etc.), you get to focus on your unique value, but you also become dependent on other people — not on your team — quickly fixing issues that are breaking your business. In order to do that, they need to trust you, and you need to provide them with accurate information.

Vendors, Vendors Everywhere

The rate of business growth today, particularly in startups, is driven largely by effective use of vendors. A great idea even with great developers can only go so far. There are many other things necessary to make customers aware of a product, to deliver that product, and to understand how that product is used. You might use a cloud platform provider for compute or serverless, managed databases, or APIs for payments, telecommunications, chat, or shipping. Even though these vendors are not part of the core value proposition for a company, they will be critical to the customer experience and the company brand.

Over the past five years I’ve acted as the primary technical contact for many vendors including cloud, observability, alerting, and auditing companies. I’d like to share some key lessons that can make everyone happier in the relationship (or at least understand what they’re not happy about.)

1. Start With Data

Whether your point of contact with a vendor is email support, a web form, chat, or paging a TAM, having data in a consistent format that’s been shown to be effective is the fastest way to get something fixed. Time series charts with annotations often provide significantly faster escalation and resolution.

Many people seem to believe that if the service is down for them, it’s down for everyone and the vendor must clearly know. This is rarely the case. You may be the only company that has business impact or that has the observability tooling to see the issue (or both).

Thus, every initiation of vendor contact will be more successful if it includes:

  1. A screenshot / link / graphical example of the behavior to be discussed
  2. A textual explanation of how that information is being interpreted
  3. What that means to you as a customer, either realized impact or projected short-term risk for impact to your business
  4. A specific question or requested action
The Secret to Effective Relationships with Cloud Providers, Vendors, and Everyone Else You Are Paying - example vendor communication
An Example Vendor Communication

2. Be a Partner

When something is seriously wrong, there is a lot of stress, and it’s often the case that you’re feeling more pain than your vendor. Still, it is generally much more effective to delay a desire to “turn the screws” until well after the incident. Treating the contacts at the vendor as partners (rather than outsiders) who are working together with you to solve a shared issue has worked magic for me. Time and time again. It allows you to easily ask questions like the following and get real answers:

  • What other information can I share with you to help make the problem clearer?
  • Are you seeing the same thing?
  • How are you currently prioritizing this?
  • How can we help?
  • Is there anything that isn’t as impacted?

Vendors have to contextualize what you see into their systems which often look quite different from your view. Every opportunity you have to encourage their belief of and empathy with your view of their systems is a way to accelerate resolution.

3. Admit Error

Sometimes it really isn’t the vendor’s systems. The faster you can let them know that you’ve found the issue and that it wasn’t them, the less resources they waste and the more they view you as trustworthy — even if you were wrong this time. If you don’t update a vendor in a timely fashion, they can waste many hours trying to find something that doesn’t exist.

Fast forward to the next time you ask for help and request they prioritize resources for you: Your vendor will hesitate. They will likely place someone else’s needs above your own. Don’t let that happen.

A Need for Observability

When you look to interact with any vendor, especially in a real time scenario, you must think through how you’ll know how your usage of that vendor offering is actually working. You also need to understand how to see that interaction in the context of your key business transactions. Without these two pieces of information, you can’t take a proper data-driven approach, nor can you be a good partner. You will also find yourself not wanting to admit error or even being able to, since you’ll be unable to find causes for issues.

There is a common theme to all three lessons: without observability that provides consistent, trustworthy, actionable information, you cannot have effective vendor relations.

How to Get Started with Chaos: A Step-by-Step Guide to Gamedays

When you first start deploying applications in the cloud, it can feel amazing. You just tell the system to do something and suddenly your code is available to everyone. A bit later though, you’ll likely experience failure. It could be failure of the instance running the code, networking to clients, networking to databases, or something else.

After a while, that failure can seem like Chaos: uncontrolled, unpredictable, unwelcome.

Enter Chaos

It’s often from this place that you may hear about Chaos Engineering and wonder “why would I ever want to do that?!” Chaos Engineering seeks to actively understand the behavior of systems experiencing failure so that developers can decide, design, implement, and test resilience strategies. It grows out of knowing that failure will happen, but you can choose to see it with a clear head at 2 p.m. instead of confused, half awake, and stressed out at 2 a.m.

“Everything fails all the time”
— Werner Vogel, VP & CTO at Amazon

 

Chaos Gamedays

Chaos Gamedays are an ideal way to ease into Chaos Engineering. In a Chaos Gameday, a “Master of Disaster” (MoD) decides, often in secret, what sort of failure the system should undergo. He or she will generally start with something simple like loss of capacity or loss of connectivity. You may find, like we did, that until you can easily and clearly see the simple cases, doing harder or more complex failures is not a good way to build confidence or spend time.

So, with that said, let’s take a look at how to run a gameday.

Chaos Gameday: Planned Failure

With the team gathered in one room (physical or virtual), the MoD declares “start of incident” and then causes the planned failure. One member of the team acts as first on-call and attempts to see, triage, and mitigate whatever failure the MoD has caused. That person is strongly encouraged to “page” other members of the team and bring them in to help understand what’s happening. Ideally the team will find and solve the issue in less than 75% of the allocated time. When that has been done or the time allocated for response has ended, the MoD will reverse the failure and the team will proceed to do a post mortem of the incident.

Chaos Gameday: Escalation

It is entirely possible that, when starting out, the team will be unable to find or solve the problem. The Master of Disaster can escalate the failure to make it more visible, because often full outages are the only observable failures. Don’t be too worried if this happens: Observability that hasn’t been tested for failure scenarios often does not show them. Knowing this is the first step in fixing your instrumentation and visualization, and ultimately giving your customers a better experience.

Chaos Gameday: Post Mortem

The post mortem should follow the usual incident process (if there is one) and/or follow best practices like PagerDuty’s. Effective post mortems is a broad topic, but I’d encourage you to include sharing perspectives, assumptions that were made during responses, and expectations that didn’t reflect the behavior of the system or observability tooling. Following out of the post mortem, you should have a set of actions the first fix any gaps in observability for the failure scenario. You also likely will have some ideas about how to improve resilience to that failure.

The key to the Chaos Gameday process is to, at the very least, repeat the failure and validate the specific changes to observability and resilience that were made to the application.

How Chaos Gamedays Can Transform Your Team

If you follow this process regularly, you will see a transformation in your team. Being first on-call for Chaos Gamedays, even though it’s not “real”, builds composure under pressure when doing on-call for production outages. Not only do your developers gain confidence in their understanding of the systems and how they fail, but they also get used to feeling and being ok with pressure.

Some concrete benefits:

  • A more diverse on call inclusive of those who do not feel comfortable with a “thrown in the deep end” learning process.
  • Developers encounter failure with up-to-date mental models of the behavior of systems, instead of just whenever they happened to be on call during a failure last.
  • Leaders have confidence that new team members are ready to handle on-call and have clear ways to improve effectiveness.

The transformation in systems is as dramatic. Developers, since they regularly experience failure as part of their job, start designing for failure. They consider how to make every change and every system observable. They carefully choose resilience strategies because the vocabulary of resilience is now something they simply know and speak.

It’s not that systems become resilient to the specific things done to a specific system in a Chaos Gameday for, they become resilient, by design, for all the scenarios that the developer knows exist and are likely.

Starting the Journey of Chaos Engineering is as simple as a “sudo halt”. Following the path will grow your team and your systems in ways that are hard to imagine at first, but truly amazing to see become real. If you would like confident on-call, happy developers, and resilient systems, I encourage you to start that journey. We’re happy to help. Feel free to reach out at @1mentat.

What Happens When Your Cloud Integration Starts without Observability

Cloud services have changed the way applications are developed. They allow teams to focus on their value proposition, product, and customers. As part of the evaluation for a cloud service, you might talk to friends, look at recent feature additions, speak with sales about their roadmap. You may be choosing a cloud service to offload the operational burden to someone else, but just because you’ve offloaded it doesn’t mean it can’t fail.

Even if, or perhaps especially if, you’re a small company, understanding what will happen to your customers and to your business if the cloud service fails becomes key. It pays to look at status pages, recent outages, or public post-mortems. However, the true test is when you integrate the new feature using a cloud service, and see how your workload and the performance (or failure) of the cloud service interact.

Delaying this assessment until late in a project creates substantial risk.

Mind the (Instrumentation) Gap

Working on a recent project, my team had created what we believed to be a scalable and resilient architecture. We were just starting to use Chaos Engineering to test out our resilience and observability. During the second round of testing I was responsible for determining the experiments to try. I thought it might be interesting to test how we would observe a third-party cloud component (let’s call it PipelineAPI) failure.

The result of the experiment was that we didn’t — and couldn’t — observe PipelineAPI failure.

After a couple of sprints dedicated to closing the instrumentation gap for all cloud services in the project, during normal operation we started to see significant performance variance by one of the cloud services, we’ll call LogsAPI, in the critical path. After dashboard screenshots, discussions with support, and eventually a meeting with the product manager for LogsAPI, it became clear that it was not designed to support our use case. We pivoted to another similar cloud service, we’ll call BigDataAPI, with all the instrumentation in place from the start. We observed consistent latency and consistent availability of the data in the data store with no change over time or increasing amounts of data. With this data we gained confidence that BigDataAPI would be able to support our use case and growth.

Instrumentation Is More Than the Code You Write

To be honest, we got lucky. We hadn’t thought through how we would observe the performance of third-party cloud services as we scaled out the system. We didn’t instrument some of the core functionality of our system because it was not the code we were writing or testing. If we’d launched without the testing and observability our system would have failed at even 5% of the target traffic level. Instead we scaled smoothly over 100x in the first two weeks and had a deep understanding of the performance and resilience of our system the entire time.

It is better not to be lucky. Measure the performance and availability of any cloud service you are designing into a solution — as soon as possible and well before full production deployment. Watching the trending of outlier performance, it becomes easier to see if the cloud service is keeping up with your testing or canary traffic, or not. Conversations with cloud service vendors about expectations and performance are easier when you have consistent high resolution data to support your observations and questions.

No matter what you are building, the customer has expectations of your system. Closely observing your cloud service integrations is the best and easiest way to make sure you meet those expectations.

How to Launch a Distributed Tracing MVP with Just 50 Lines of Code

There comes a time in every successful technology company’s life when there’s a realization that it’s not quite clear what’s happening in production, and that lack of clarity is impacting customers. That point may happen with a monolith, a distributed monolith, SOA, microservices, or, often, a mix of them all. Perhaps someone says “let’s do distributed tracing, it should solve all of our observability problems.” However when you look at the investment involved, you may end up thinking, “that’s hundreds of thousands of dollars in people’s time, not even counting the cost of the service, how can I possibly justify that?”

This post is for people facing this same question I faced three years ago.

Wearing the Customer’s Shoes

Working at Twilio on the Insight Engineering team, I had the opportunity to spend a few months looking at what it would take to “do distributed tracing.” Twilio had hundreds of services in several languages. There was significant “migration fatigue” after OS version, instance generation, and Classic to VPC moves. The appetite for another cross-team effort was low.

At first it seemed like an impossible problem: Distributed Tracing would require efforts across teams in different languages, different frameworks, all on different schedules. It was an impossible problem looking at it that way. But somehow I needed to find the MVP for getting started with distributed tracing.

Twilio has a saying, “wear the customer’s shoes.” Reflecting on this, I decided that the best way to do that was to start instrumenting as close to the customer as possible, at the API edge service. By starting there I would see what each customer was experiencing for the entire time our platform was handling the request for each endpoint and method. I could tag each trace after authentication so that we could see a particular customer’s experience. Even better, when we decided to instrument further into the services that handled any given request, we’d always have that “customer’s shoes” context to start with.

Getting to a Root Cause

In the spirit of minimum viable product, I put together a PR of less than 50 lines. For every request received, it would create a span that represented the amount of time it took us to respond to that request. It used a standard prefix indicating it was a public API request, the standard reference name for the API resource, and tagged the method and customer. I also wrapped every request where the API service was a client to other services tagged with the downstream service and method. After some experiments, including some Chaos testing in staging, I was cleared to deploy a canary to production.

Though we had metrics and simple histograms before, what we could see with this view — especially over time — was a game changer. The canary happened to be deployed during a performance issue with a downstream service. I was able to bring the cause of the issue to both the API and service teams quickly, and they were able to rollback within minutes. With this demonstration of the capability of tracing, there was suddenly interest in removing the API team from the critical path for identifying the root cause for performance regressions or outages.

A Playbook for Launching Your Distributed Tracing MVP

The 50-line PoC turned into a purposeful refactor of the request handling and client code to provide a simple single point of integration for tracing. Overall, the resulting changes were less than 200 lines of code and a bit more than one week of one engineer on the API team — substantially less than 20 or so person years of time it had originally appeared to be.

If you’re wondering how to get started with tracing, consider using this pattern as a playbook:

  1. Identify a part of your service that’s as close to your customer as you can get.
  2. Look for patterns in how that service receives requests that enable you to instrument once or at most a handful of times.
  3. Find trends in how that service makes requests to services, SaaS, and databases.
  4. Follow production deployment steps (staging, canaries, or whatever other risk management strategies your company uses) and start getting real data.
  5. Compare trace data with other metrics and understand the potential cause of differences.
  6. Observe the visibility of failure, either “naturally” or induced by Chaos testing.
  7. After you’ve found a key use case, continue to make measured investments driven by observed value.

While this approach is helpful, at some point you will face the challenge of perspective. If you only have the edge’s client perspective and the perspective of server differs, you’ll need to figure out how the truth lies between them.

In future posts, I’ll cover mobile- and browser-based perspectives, integrating a service mesh into your tracing, and methods for adding internal services using frameworks or middleware.

If you have any questions about getting started with a Distributed Tracing MVP, you can reach me on Twitter @1mentat.

Why I Joined LightStep

I still remember my first solo on-call experience at Twilio. I’d shadowed on call for two weeks out of the last eight. I was supposed to be prepared. It was around midnight when the page came in: There was an issue with networking in Australia, we were down. It was the beginning of the evening rush for critical customers. I represented cloud operations, I was supposed to know our cloud provider inside and out, to be the subject matter expert. Looking at the behavior of the network, I was lost. Gradually other teams who had moved their traffic out of Australia dropped off the call, but I was still there, trying to get resolution from cloud provider support for hours. To this day, I’m not sure what happened, but what I remember most of all was the feeling of helplessness, of uncertainty, of powerlessness.

As time has gone by and I’ve come to understand development and operations better, my passion has grown to make sure that when people, whether developers, operators, or devops, are put in that position of pressure or of needing to find an answer — they will have the tools and the training to understand how to take action. They should feel empowered, confident, and able. Through my time at Twilio and then Stitch Fix, I saw the impact that timely insights could provide. I saw that by training with Chaos GameDays instead of incidents, engineers with all levels of experience could become effective at operating their systems under stress. The growth went beyond that, they excelled at designing, developing, and validating systems to be operable.

I’ve seen not only the impact that great observability can have, but also the unique possibilities that LightStep enables with its Satellite Architecture. Teams can quickly understand from a handful of traces what each one needs to do to recover a positive customer experience as quickly as possible during an incident. Developers can see the behavior of their code in production and how it differs from their expectations. Support engineers can directly access and understand their customer’s experience of the service or platform. Everyone can build empathy for their customers and clients, whether internal or external.

I am excited to join LightStep to advocate for and with developers and operators. To share that there is a better way to experience their systems and how to get there. While I work with LightStep, much of the information and guidance will be helpful whatever part of the journey of observability you’re on. In the spirit of OpenTracing, the starting point is to develop a shared understanding of the problem we’re trying to solve, what makes it difficult, and how we can get better together.

I look forward to growing and learning with all of you.