Plaid Investigates CI/CD Issues 20x Faster with LightStep

Plaid is a data network powering the fintech tools that millions of people rely on to improve their financial lives.

Since 2012, Plaid has been focused on democratizing financial services through technology. They build beautiful consumer experiences, developer-friendly infrastructure, and intelligent tools that give everyone the ability to create amazing products that solve big problems.

As an engineering organization building a polyglot architecture (primarily Go, Node.js, and Python), Plaid was looking for a distributed tracing solution with high-quality client libraries that could drive their observability, monitoring, and alerting initiatives.

Plaid found what they were looking for with LightStep, which enables the team to spend less time debugging code — and more time releasing features to their customers.

The Challenge

Homegrown Solution vs. LightStep

Before moving to LightStep, Plaid rolled out a homegrown observability prototype backed by their existing logging stack: Elasticsearch, Kibana, and Logstash.

While this tech stack works well for logging, it became apparent almost immediately that tracing required a different solution.

“Our Elasticsearch cluster is large, and when we tried to dump trace data, we realized we ingested far too high of a data volume. It only took 10 minutes before our data infrastructure team noticed,” said Omar Mezenner, Software Engineer at Plaid.

“We realized in order to ingest data — or even to deploy Jaeger or Zipkin with their own separate data store — it would require at least one full engineer for maintenance. That’s when we started looking at hosted solutions and found LightStep,” said Mezenner.

The Solution

Complete Visibility into Distributed Architecture

With LightStep’s Service Diagrams feature, the Plaid team can view and interact with their entire system architecture in real time — as it operates, with no ingestion lag.

“Nothing beats looking at the system in real time in LightStep,” said Mezenner. “Engineers can orient themselves and see what and how services interact with each other.”

Service Diagrams show how services relate to one other, reveal dependencies, and dynamically highlight services and components that contribute to the latency or are experiencing errors.

“LightStep makes it painless for new engineers can quickly get up to speed and understand our architecture,” Mezenner said.

Pinpointing the Root Cause

Cross-service visibility is critical to Plaid, but even more so is the ability to dig deep into their microservices architecture and surface exactly what causes an issue.

For example, Plaid’s core system, Scheduler, is responsible for pulling bank data regularly. It uses a MySQL database, and it is one of the highest load systems by virtue of responsibility.

When Scheduler has performance problems, they can arise from a variety of causes — not just CPU or locking, but also changes in queries, changes in schemas, even changes in utilization from exposed RPCs.

LightStep fundamentally improves and expedites these investigations.

“Rather than having to comb through all the dependencies, LightStep pages me and I can see specifically where the issue lies — down to the specific MySQL query!” said Mezenner.

The Results

Fast Failure Analysis for Integration Tests on Deep Systems

“Imagine you have large backend integration tests touching 20-plus services. Someone has a PR, but it breaks an integration test. How do they debug that?” asked Mezenner.

“We take Snapshots of failures for our integration tests for our CI system.”

LightStep Snapshots include detailed latency histograms that characterize different system behaviors for a service, operation, and/or tag values. Additionally, they provide thousands of relevant traces to help explain the symptoms observed.

Before LightStep, the Plaid team would have to go through logs from each of the services and correct any issues that were identified.

“Now we can pull up LightStep and quickly diagnose where a failure happens,” explained Mezenner.

If a CI test fails, a Snapshot with the specific trace_id is automatically created via the LightStep API.

“As soon as a failure happens, we display the link of the Snapshot to developers, so they can see the failure. If the test fails, you can see every RPC (remote procedure call) made for the test — even if it was the 6th service down. Developers can also see the top-level error reason, and then can look at the related services logs.”

“This has reduced our need to worry about 25 services, all the way down to one,” said Mezenner.

The time savings is huge:

“With LightStep, it takes us three minutes to know which service is affected, where before required grepping for an hour. This is a 20x reduction in identifying the root cause of an integration test failure.”

Working with LightStep

“I love working with the LightStep team,” said Mezenner. “They are attentive and always available on Slack.”

LightStep Customer Success is not only available to help, they are a highly technical team with decades of experience building and optimizing large-scale systems.

“In general, it’s been great to work with a partner vendor that is so actively engaged in our success. You can get on the phone with LightStep, and they will help you with any issues you have.”

Challenges

  • Incomplete visibility into microservices architecture
  • Manual process for finding and resolving errors
  • High storage and labor costs of building an in-house observability solution

Business Results

  • Reduced the total time to identify the root cause of integration test failures by 20x
  • Removed the need to hire one or more full-time engineers to manage observability efforts
  • Enabled engineers to spend more time building features and less time fixing bugs

Organization Details

  • Headquarters: San Francisco, CA
  • Industry Segment: Financial Technology
  • Employees: 300+
  • Funding: $310M
Download PDF

Learn from the dreams and the nightmares of those managing production software

No hype – just thoughts about software performance and reliability for modern systems.

Stay Informed