Organization Details
Headquarters: San Francisco, CA
Industry Segment: Computer Software
Employees: 2,000+
Challenges
Experiencing latency with unknown causes
Incomplete system visibility
Mixed naming conventions leading to team confusion
Business Results
Reduced Mean Time To Resolution (MTTR)
Complete visibility into the GitHub Load Balancer (GLB)
Adopted OpenTelemetry and semantic conventions
GitHub helps 65+ million developers and 3+ million organizations to build software, including tracking issues, integrations, code review, security, and more.
GitHub uses Lightstep daily to investigate latency, test hypotheses, and rapidly resolve performance issues across their monolithic and microservice architecture. In combination with OpenTelemetry, GitHub is able to ensure that millions of developers have a seamless experience with their platform, as well as increase the productivity of their internal developer teams.
The challenge
GitHub recently announced the decision to adopt OpenTelemetryOpenTelemetry and immediately saw the benefits. As a first step, GitHub created an internal repository of semantic conventions using the same tooling as the OpenTelemetry community. Ariel Valentin, Observability Engineer, mentions, "The OpenTelemetry semantic repository will enable our development teams to release faster because they will be able to pull from a standard set of conventions across the organization."
Immediately, GitHub saw an increase in developer productivity by adopting OpenTelemetry SDKs. The SDKs and best practices have saved countless hours of instrumentation when a back-end change is needed. “OpenTelemetry empowers us to build integrated and opinionated solutions for our engineers,” said Wolfgang Hennerbichler, Senior Engineering Manager. “Designing for observability can be at the forefront of our application engineer’s minds because we can make it so rewarding.”
Adopted OpenTelemetry and
semantic conventions
"Lightstep provides me insights that were previously rooted in my limited understanding of a complex system. What I love about Lightstep is that it tells you what is actually happening in your system as opposed to what you think is happening. I cannot imagine doing my job without it."
Ariel Valentin
Observability Engineer
The solution
Immediately, GitHub saw an increase in developer productivity by adopting OpenTelemetry SDKs. The SDKs and best practices have saved countless hours of instrumentation when a back-end change is needed. “OpenTelemetry empowers us to build integrated and opinionated solutions for our engineers,” said Wolfgang Hennerbichler, Senior Engineering Manager. “Designing for observability can be at the forefront of our application engineer’s minds because we can make it so rewarding.”
With Lightstep being a major contributor and co-creator of OpenTelemetry, GitHub was able to get started quickly with the help of the Lightstep teamLightstep team. Working with the Lightstep Product Engineering team, GitHub was able to add features to the OpenTelemetry Ruby SDK and the Lightstep team was able to improve the interoperability with Lightstep's OpenTracing SDK. Ariel continues, "For GitHub, we needed to make a few changes to the way the OpenTelemetry Ruby SDK worked, so with the help of the Lightstep Product team, we were able to propose changes and improve the SDK for our organization."
The results
Finding the needle (in a stack of needles)
Recently, the GitHub team had difficulty resolving an issue. There was latency within their system with no visibility into the why or where, and to add further complication to the problem, it appeared to be sporadic. GitHub engineers investigated the issue through their normal methods, which include searching through logs and exception notification systems in order to diagnose the issue. Unable to pinpoint an issue, the team decided to see if Lightstep could identify the problem.
From the investigation, the team found that the latency wasn't a part of their microservice but, rather, part of the GitHub Load Balancer (GLB). The GLB manages every interaction in the system and directs all traffic for both internal and external users, which meant this issue was directly affecting all users. “Developers have limited visibility into trying to figure out what the cause of the latency stemmed from,” said Ariel. “We were trying to connect things in a log stream from one system to the load balancer to another. None of those keys matched up because folks were using different logging formats. In all those cases, there was no normalization. It was really hard for them,” said Ariel.
Through their work with OpenTelemetry and Lightstep, the team was able to identify the exact, end-to-end request that came in (first leaving the monolith then trying to make a call to the auth system). Within minutes, they were able to pinpoint the issue and quantify the impact. The sporadic behavior of the error compounded the challenges in finding and diagnosing the problem. “In the 99.8 percentile, we were only seeing 20 milliseconds of latency but in the 99.9 percentile it doubled to 40 milliseconds,” said Ariel. “If it was a consistent problem, it would be easier for us to track it down but it’s incredibly hard to find. 20 to 40 milliseconds of latency across millions of requests is a problem.”
"Lightstep provides me insights that were previously rooted in my limited understanding of a complex system. What I love about Lightstep is that it tells you what is actually happening in your system as opposed to what you think is happening. I cannot imagine doing my job without it."
Developer shortcut
Learn how you can use Lightstep's Change Intelligence to find the root cause when you notice a deviation in your metrics.
Explore more case studies
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems