InVision is the leading product design collaboration platform that powers the world’s best user experiences. With intuitive tools for ideation, design, prototyping, and design management, the InVision platform gives users everything they need for digital product design, all in one place. More than 4 million people – at tens of thousands of companies, including eighty percent of the Fortune 100, and brands like Airbnb, Amazon, HBO, Netflix, Slack, Starbucks, and Uber rely on the InVision platform to make products users love.
InVision is a fully distributed company with employees in more than 20 countries around the world. Like the company, the architecture of the platform is also distributed and is comprised of more than one hundred microservices owned and operated by dozens of engineering teams.
Market leaders rely on the InVision platform to deliver a great user experience. InVision application performance and reliability are critical to maintaining customer relationships. With distributed teams and distributed systems, pinpointing the root cause of performance issues was extremely difficult, if not impossible, with its existing Application Performance Management (APM) solution from New Relic. “We could monitor individual services, but we had no visibility into the communication between the services. During an incident, we’d have a customer reporting an issue with a transaction that incorporated two to three individual microservices and teams. Each team would examine the performance of its microservice and say everything looked fine; the issue must be with another team and service,” said Jeremiah Jenkins, Manager of the Platform Labs and Data Services team, InVision. The InVision engineering team also needed a macro view of interactions across services and clients. “There was a lot of back and forth whenever there was a performance issue and finding the root cause was often done through the process of elimination which was extremely time intensive,” said Jenkins.
InVision needed full visibility across the distributed services to understand the root cause and get the right team engaged in resolving the situation. The team wanted a solution that would analyze 100 percent of the performance data. New Relic APM could not fulfill this need because its architecture forces it to sample at the host. Plus, its pricing model (which is based on hosts and data retention) was cost prohibitive.
“We knew we needed a solution that could trace the flow of the transaction across service boundaries,” said Jenkins. Digging through logs is a last resort because logs are so noisy. “If we’re looking at logs, something has gone terribly wrong in our process,” said Jenkins. The InVision team briefly considered Zipkin for their tracing needs, but they quickly determined that any solution that sampled the data upfront would not be sufficient.
InVision chose LightStep because it’s the only solution on the market that analyzes 100 percent of the data, 100 percent of the time with no upfront sampling, so teams never miss a performance outlier. With its unique architecture, LightStep computes advanced statistics based on all of the application performance data going through the system and stores examples of important information forever.
Today, 13 engineering teams at InVision have replaced New Relic APM with LightStep, and they have reduced MTTR and costs as a result. Teams can more quickly identify the root cause of outages and incidents. “We can now identify the root cause definitively versus in the past when it was a process of elimination,” said Jenkins. LightStep has no upfront sampling, so the InVision team never misses a transaction that matters. It also provides the granular level of detail they need to definitively identify the root cause, something they couldn’t achieve with their metrics tool.
The team developed a tool to help customers migrate from one environment to another. Using LightStep, they were able to improve the performance of the tool by 75%. They were able to identify the non-performant portions of the transaction and reduce runtime from 16 hours to four. LightStep helps InVision meet its goals of developer efficiency, system performance, and reliability by providing a complete view of the system. This allows engineers to investigate and resolve issues quickly, resulting in significant cost savings per year. In addition to those savings, LightStep doesn’t charge based on local hosts, consumption, or data volume. With LightStep, engineers spend less time on debugging and incident response, so they have more time to focus on delivering features that help drive great customer experiences.
As the InVision team works on new capabilities, LightStep is critical in helping them understand the service-to-service interactions. Prior to LightStep, InVision lacked insight into how those interactions would work in a deployed environment. Now, once new code passes code review, it’s deployed to the test environment where LightStep is used to ensure latency hasn’t been introduced into the system and that the service interactions operate as expected. LightStep also helps the team understand how legacy or inherited code operates and performs.
“Building microservices at the scale of InVision leads to complex architectures. No one engineer has a complete map in their head of service-to-service interactions. It’s too complex and is constantly changing. We rely on LightStep to help us visualize how our services are interacting and to pinpoint the root cause of performance issues,” said Brady Kimball, Director of Engineering, InVision.
Words of Advice
“Consistency and standards are incredibly important in a microservices environment. Ensure there is full visibility, not just of the services, but all underlying architecture and pieces in between; often those are the parts that have issues,” said Jenkins.
- Access comprehensive distributed traces to identify the critical path of transactions
- Observe 100% of the application performance data across the distributed system
- Get a macro view of interactions across services and clients for new and legacy services
- Pinpoint the root cause of performance issues across service boundaries
- Address deficiencies of the legacy APM solution
- Improved performance of key migration tool by 75%
- Reduced MTTR
- Ensured end-to-end performance management
- Lowered costs significantly over New Relic
- Improved developer productivity
- Headquarters: New York, NY
- Industry Segment: Design/UX software
- Employees: 500+
- Funding: $235M+