DevOps and Site Reliability Engineering – What’s Different at LightStep
We’re proud our product is used to diagnose problems for many important software systems. And as a tool used to improve performance and reliability in other applications, we must hold our product to even higher standards when it comes to those metrics. At the same time, we challenge ourselves to innovate quickly while still meeting (or exceeding) those standards.
As one of the co-founders and the CTO at LightStep, I’d like to share a bit of what it’s like to work on the engineering team, how we collaborate, and our process for bringing ideas to market.
One critical part of running highly available services is determining who is responsible for making sure that those services are available. Two related terms that get tossed around a lot here are DevOps and Site Reliability Engineering (SRE). Unfortunately, neither of these terms are particularly well defined – just Google them and see for yourself!
One of the parts of DevOps that I like best (though certainly not the only part) is that individual teams are responsible for the entire application lifecycle, from design, to coding and testing, to deployment and ongoing maintenance. This gives teams the flexibility to choose the processes and tools that will work best for them. However, that autonomy can lead to fragmentation across the org in how services are managed and duplication of effort across teams.
On the other side, SRE is often used to describe organizations that are laser-focused on product availability, performance, and incident response. While these are all important, these SRE organizations can sometimes build antagonistic relationships with the rest of engineering where SRE is seen as impeding progress for the sake of its own goals.
At LightStep, we believe in a hybrid implementation of these two philosophies, where our engineers are organized into small groups with split responsibilities but shared objectives. SRE at LightStep is responsible in part for building shared infrastructure that is leveraged by the whole organization, but they are also embedded within teams to help spread best practices and understand current developer pain points. This structure has enabled our teams to remain agile, to conduct rapid product experiments, and to have the flexibility to quickly adopt new (or discard old) technologies and tools. Retaining the natural and healthy tension between maintaining product stability and accelerating innovation to market ensures every decision we make is a balance that ultimately focuses on our customers’ success.
When considering prospective DevOps engineers or SRE (titles don’t really matter much to us at LightStep), we look for engineers who are excited about working side-by-side with the rest of our team. To us, SRE isn’t a separate organization so much as a mindset: we look for engineers who are excited to collaborate and apply a broad set of tools – including traditional operational tools like automation and monitoring as well as robust software development practices – to improve the reliability of our product and increase the velocity of individual teams and of our organization as a whole.
We’re always striving to improve how we do things and looking to new team members to help us on this journey. All of our engineers bring complementary skills and experience from both academia and industry. Above all, we value those who respect differing opinions, communicate clearly, and are empathetic towards their peers.
If you’d like to be part of this journey and would enjoy working on these engineering challenges, we’d love to hear from you!
SpoonsCTO and Cofounder
Spoons (or more formally Daniel Spoonhower) has published papers on the performance of parallel programs, garbage collection, and real-time programming. He has a PhD in programming languages but still hasn’t found one he loves.