Cognite offers industrial software that unites disparate, siloed data under a single operational context. Their Industrial IoT data platform, Cognite Data Fusion, creates new opportunities for machine learning and value creation in heavy-asset industries.
To provide this data in real time to the right people and across systems, Cognite leverages a suite of microservice solutions.
Lightstep enables Cognite to resolve issues faster, proactively optimize system performance, and gain complete visibility into their many service dependencies.
Cognite provides data management for heavy-asset industry, and there are real consequences to downtime. To mitigate these personnel and business risks, Cognite built out a microservice architecture that keeps their platform highly available and robust. But, this introduced complexity and decreased visibility into the health of the system.
“There are many potential causes of latency,” said Joel Wilsson, senior software engineer at Cognite. “Every request has multiple touch points internally, so it's not simple anymore. We need to understand what's going on across systems,”.
“If I didn't have Lightstep, I would only have logs that have been aggregated. I could look at the logs, but there would be too many, and it would take too long to resolve incidents,” said Wilsson.
“Lightstep automatically shows us what we need to see, and it’s by far the best way to understand dependencies,” said Wilsson.
Cognite’s customers depend on real-time data management, so the engineers at Cognite need an observability solution that can keep up. With the ability to group, filter, and search with high cardinality, Lightstep allows Cognite developers to surface meaningful insights in seconds, highlighting actionable data before, during, or after an incident.
“Lightstep collects all our K8s events,” said Wilsson. “The scale is really incredible.”
“There are no limits on the amount of data we can analyze or the number of traces we can create. Neither a technological limitation, nor a pricing limitation, since we are able to analyze unlimited amounts of data with no additional cost.”
With an architecture in constant flux as new customers and new tools become part of their system, Cognite uses Lightstep’s Service Diagrams to understand up-to-date, real-time relationships between their many services.
“We are always changing things so quickly that there wasn’t an up-to-date diagram showing dependencies between services,” said Wilsson. “Lightstep shows us how a request is being handled and what services are involved. The service diagrams are powerful tools.”
As an industrial SaaS company, Cognite has a lot of steady, unremarkable traffic. This makes it very important to have full latency distributions available, as any statistical sampling would likely obscure any odd (and rare) high latency data. Lightstep analyzes 100% of event data, giving Cognite’s engineers access to their complete observability data set.
“If we just did conventional sampling, we would never see everything we need,” said Wilsson. “Lightstep catches outliers that we’d otherwise be unable to detect.”
“The Correlation feature is super important to us,” said Wilsson. “I don't want to chase down 10 services and see what's going on in all of them. Correlations show us exactly what we need to look at.”
“No one has the time to look at line after line of aggregated logs.”
As a platform that handles large amounts of data, storage associated with logging was costly and problematic for Cognite.
“We've been reducing the amount of logging we're doing because it has been expensive for us. Having to remove instrumentation just doesn't feel good. With Lightstep, I don't need to worry in the same way,” said Wilsson.
“With Stackdriver and logging today, you have to find the correct Google Cloud Project in order to find the logs. Lightstep automatically gives us a link to the relevant logs for an investigation. It's the quickest way to find the logs for any request.”
Using Lightsteps’s API, Cognite built a service dashboard to help their teams stay true to performance targets.
“My team has used Lightstep for monitoring of response times, error codes, and sudden drops in operations count. With Lightstep's streams, any engineer that notices a change in performance, can find relevant spans, and use either Correlations or direct comparison between spans before and after a deployment to identify where the problem is,” said Joar Sæther, director of reliability engineering.
“The Reliability team can, without access to code, pinpoint what deployment caused degradation, and escalate to the correct team. We get good numbers from Prometheus as well, but Lightstep streams provides access to the root cause as well as the alerting and reporting.”
“The reports built on Lightstep streams allow us to manage error budgets on our clusters properly,” said Sæther. By avoiding service credits, handling issues before they create a Major Incident, and minimize on-call work, the cost savings can be of the order $50k a month even without any ‘surprises.’ The more customers we get, the more impact Lightstep will have on efficiency and cost saving.”
In November of 2018, Cognite had an incident involving getConnection latency that they were able to resolve in under an hour with Lightstep.
With this incident, only a relatively small number of requests were slow and causing issues. Troubleshooting with logs or metrics would have been challenging, time-consuming, and require new deployments.
“Using logging or other tools, this could have taken the entire day,” said Sæther. “Lightstep immediately shows us what is slow, why it’s slow, and helps us understand what we need to do to resolve any issues.”
Not only was Lightstep able to help the Cognite team resolve this incident faster, Lightstep’s Snapshots captured the incident as it happened, allowing his team to compare that state of the system to any captured state over the last two years.
“Snapshots are an amazing feature. We include them in our post mortems, which allows us to have better context and understanding of issues.”
“It is also worth calling out that the unlimited cardinality of Lightstep in combination with the Correlation feature, allows us to spot user activity of a type that the system was not designed for,” said Sæther. "You find the API-key or tenant generating problematic traffic in 30 seconds instead of processing logs for 2 hours.”
Lightstep has offered a near-seamless experience to the team at Cognite from installation to onboarding to providing major ROI.
“It’s mostly hands off once configured, with zero maintenance,” said Paul Salaberria, senior infrastructure engineer.
“Last year we did a disaster recovery exercise, and set up a new installation and ran a backup from another test. We got Lightstep up easily and didn't have to config. Lightstep picked up the new cluster, and we could immediately start monitoring requests in the new installation.”
“It doesn’t take much effort to get Lightstep to work,” said Salaberria.
“Anytime anything goes wrong in our system, we can easily find the request_id in Lightstep and look at the details of what happened,” said Wilsson.
“We have many different instances of our service, and for large clients we have separate installations. We can immediately find the answers we need in Lightstep. There’s no need to toggle between different environments,” said Wilsson.
“I'm really impressed by the polish that Lightstep brings to features, even new ones,” said Wilsson. “Oftentimes we have a feature request, and Lightstep is already working on it.”
“Lightstep thinks about what I need before I do.”
Whether via Slack, email, conference call, or in-person in Lysaker, Norway, the Lightstep Customer Success team and key developers are available virtually 24x7.
“Working with Lightstep has been a pleasure,” said Wilsson. “There hasn’t been a question we can't get an answer to. We are listened to,” said Wilsson.
“Every time I log in there is so much potential, and every time I work with the Lightstep team there is so much success.”
Learn how you can share your findings through Slack with Lightstep snapshots