Lightstep brings observability to Cognite’s industrial data platform
Headquarters: Lysaker, Norway
Industry Segment: Industrial SaaS
Architecture too complex for manual resolutions
Fast growth requiring a scalable observability solution
Isolating meaningful outliers in a sea of data
Reduced logging costs
Faster MTTD and MTTR
Increased developer productivity
Cognite offers industrial software that unites disparate, siloed data under a single operational context. Their Industrial IoT data platform, Cognite Data Fusion, creates new opportunities for machine learning and value creation in heavy-asset industries.
To provide this data in real time to the right people and across systems, Cognite leverages a suite of microservice solutions.
Lightstep enables Cognite to resolve issues faster, proactively optimize system performance, and gain complete visibility into their many service dependencies.
Filtering out the Noise
Cognite provides data management for heavy-asset industry, and there are real consequences to downtime. To mitigate these personnel and business risks, Cognite built out a microservice architecture that keeps their platform highly available and robust. But, this introduced complexity and decreased visibility into the health of the system.
“There are many potential causes of latency,” said Joel Wilsson, senior software engineer at Cognite. “Every request has multiple touch points internally, so it's not simple anymore. We need to understand what's going on across systems,”.
“If I didn't have Lightstep, I would only have logs that have been aggregated. I could look at the logs, but there would be too many, and it would take too long to resolve incidents,” said Wilsson.
“Lightstep automatically shows us what we need to see, and it’s by far the best way to understand dependencies,” said Wilsson.
Kubernetes Observability at Scale
Cognite’s customers depend on real-time data management, so the engineers at Cognite need an observability solution that can keep up. With the ability to group, filter, and search with high cardinality, Lightstep allows Cognite developers to surface meaningful insights in seconds, highlighting actionable data before, during, or after an incident.
“Lightstep collects all our K8s events,” said Wilsson. “The scale is really incredible.”
“There are no limits on the amount of data we can analyze or the number of traces we can create. Neither a technological limitation, nor a pricing limitation, since we are able to analyze unlimited amounts of data with no additional cost.”
With an architecture in constant flux as new customers and new tools become part of their system, Cognite uses Lightstep’s Service Diagrams to understand up-to-date, real-time relationships between their many services.
“We are always changing things so quickly that there wasn’t an up-to-date diagram showing dependencies between services,” said Wilsson. “Lightstep shows us how a request is being handled and what services are involved. The service diagrams are powerful tools.”
"If we just did conventional sampling, we would never see everything we need. Lightstep catches outliers that we’d otherwise be unable to detect."
Senior Software Engineer
MTTD and MTTR
"Snapshots are an amazing feature. We include them in our post mortems, which allows us to have better context and understanding of issues."
Director of Reliability Engineering
As an industrial SaaS company, Cognite has a lot of steady, unremarkable traffic. This makes it very important to have full latency distributions available, as any statistical sampling would likely obscure any odd (and rare) high latency data. Lightstep analyzes 100% of event data, giving Cognite’s engineers access to their complete observability data set.
“If we just did conventional sampling, we would never see everything we need,” said Wilsson. “Lightstep catches outliers that we’d otherwise be unable to detect.”
“The Correlation feature is super important to us,” said Wilsson. “I don't want to chase down 10 services and see what's going on in all of them. Correlations show us exactly what we need to look at.”
“No one has the time to look at line after line of aggregated logs.”
Improved System Performance and Reduced Logging Costs
As a platform that handles large amounts of data, storage associated with logging was costly and problematic for Cognite.
“We've been reducing the amount of logging we're doing because it has been expensive for us. Having to remove instrumentation just doesn't feel good. With Lightstep, I don't need to worry in the same way,” said Wilsson.
“With Stackdriver and logging today, you have to find the correct Google Cloud Project in order to find the logs. Lightstep automatically gives us a link to the relevant logs for an investigation. It's the quickest way to find the logs for any request.”
Using Lightsteps’s API, Cognite built a service dashboard to help their teams stay true to performance targets.
“My team has used Lightstep for monitoring of response times, error codes, and sudden drops in operations count. With Lightstep's streams, any engineer that notices a change in performance, can find relevant spans, and use either Correlations or direct comparison between spans before and after a deployment to identify where the problem is,” said Joar Sæther, director of reliability engineering.
“The Reliability team can, without access to code, pinpoint what deployment caused degradation, and escalate to the correct team. We get good numbers from Prometheus as well, but Lightstep streams provides access to the root cause as well as the alerting and reporting.”
“The reports built on Lightstep streams allow us to manage error budgets on our clusters properly,” said Sæther. By avoiding service credits, handling issues before they create a Major Incident, and minimize on-call work, the cost savings can be of the order $50k a month even without any ‘surprises.’ The more customers we get, the more impact Lightstep will have on efficiency and cost saving.”
Faster Incident Resolution
In November of 2018, Cognite had an incident involving getConnection latency that they were able to resolve in under an hour with Lightstep.
With this incident, only a relatively small number of requests were slow and causing issues. Troubleshooting with logs or metrics would have been challenging, time-consuming, and require new deployments.
“Using logging or other tools, this could have taken the entire day,” said Sæther. “Lightstep immediately shows us what is slow, why it’s slow, and helps us understand what we need to do to resolve any issues.”
Not only was Lightstep able to help the Cognite team resolve this incident faster, Lightstep’s Snapshots captured the incident as it happened, allowing his team to compare that state of the system to any captured state over the last two years.
“Snapshots are an amazing feature. We include them in our post mortems, which allows us to have better context and understanding of issues.”
“It is also worth calling out that the unlimited cardinality of Lightstep in combination with the Correlation feature, allows us to spot user activity of a type that the system was not designed for,” said Sæther. "You find the API-key or tenant generating problematic traffic in 30 seconds instead of processing logs for 2 hours.”