Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move.
With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.
After joining and having settled in after a day or two, my mission became clear: The primary objective was to build a roadmap for our infrastructure, and then set out to deliver it.
At this company, reliability is something that’s valued even more than at most others; as providers of the tooling you depend on to pick up when your own systems are broken, we become a critical dependency and need to have our product available whenever you need us.
This helps set some initial context for what might be going into the roadmap: an emphasis on availability and reliability.
But if you find yourself in this kind of position, how do you start here and produce a roadmap, starting at zero context?
Here are my tips and advice for breaking down this problem.
Starting with a blank space where you might usually expect to have a roadmap, coming from working in organizations where many years were spent considering these kinds of topics, is a different challenge but not an insurmountable one.
Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.
Before diving in and making any changes, it’s obviously pretty vital to get a good feeling for what the current setup is within an organization.
Part of an SRE’s job, especially within a smaller team, is to enable and accelerate engineers, so you’ll need to both build empathy around the daily processes that engineers go through and have a sufficient technical understanding to make small changes to the product when required.
Definitely spend some time in “product land.” You should have a fully functioning development environment, a good understanding of the structure of the primary codebase and be able to put this into practice by picking up some smaller changes and delivering them from start to finish.
If your organization has a model like our Product Responders, pairing with those will give you a feel for some of the gnarlier issues.
Going through this process, having stepped into the shoes of a product engineer should present some great opportunities to learn not just about the core product, but also the details around deployments, observability tooling and data stores.
Now this may sound obvious, but learning as much context as possible from those who’ve been living and breathing the systems you’re taking on responsibility for is going to be crucial.
Whenever you’re not working on onboarding or building up technical knowledge, try to fill the gaps with chats over coffee, going outside for a walk or grabbing lunch together.
Beyond building relationships, which is important in itself, this is a great opportunity to find out about current pain points, tools they wish they had and any projects that had been deferred “until we have an SRE.”
Especially important is remembering not just to limit yourself to talking to those grizzled veteran engineers who’ve seen it all, but also the new joiners who may have useful viewpoints, the product and engineering managers who get a good aggregate view from the engineers that they work with, and leadership who will have useful input on longer-term vision and vendor relationships.
Keep an eye out for whenever your organization’s customers are getting in touch with any issues that relate to infrastructure or shared concerns, whether that’s through asking the customer support team to keep you in the loop or monitoring your internal incidents channel.
If the opportunity is there, you should join discussion and talk with the customer directly, which will allow you to dig into the details further.
These types of interactions may be less frequent than others, but they are very valuable, as they give you an insight into what customers value (such as latency vs. availability) and how they interact with the product, and allow you to start understanding what kinds of trade-offs you can make further down the line.
As an SRE at an early-stage company, there’s a good chance that you’ll need to bring in new platforms, tools and processes. There’s also a reasonable chance that these will look different from where you’ve worked previously.
Perhaps the business needs are different, or it’s just that industry trends have evolved beyond the systems you’ve worked with previously.
As you start to build up ideas for what kinds of changes you’d like to make, you might find that being the sole SRE makes it tricky to know if you’re on the right track. It’s really useful to validate ideas like this with people outside of your organization too.
Is the hot, new container technology you’re looking at not as good as it’s cracked up to be?
Perhaps contacts at similar-sized companies will have some insight. Similarly, if your company has existing relationships with platform vendors, then lean on them by sending your thoughts and proposals over to the account manager. They may be able to tell you whether you’re following best practices and what their recommendations are based on similar-sized orgs.
If you’ve followed some of the suggestions above, then you should now have a good feeling about the issues and missing building blocks within the organization and be able to make some informed decisions about the next steps.
You’ll also have a lot of diverse inputs from different stakeholders, so you’ll need to distill all of this into a sensible roadmap.
The key to this will be picking out common themes in the information by applying some grouping, but after that, you’ll need to make some calls about how to tackle the problems.
The insights you’ve gained into the business, customers, engineer workflows and the product should help you out here.
Be careful not to plan too far into the future—you should focus on the problems that are causing pain right now, and then revisit them in a few months’ time, rather than trying to set out a multiyear strategy from the beginning.
To make this all more concrete, here’s the summarized roadmap that I created.
Theme: Compute — How we deploy and run our core codebase
Theme: Database — How we store the data that powers our product
Theme: Observability — How we monitor the health of our product and systems
Distilling all of this into a document and sharing it with key stakeholders should give you the buy-in to go after the problems you need to tackle.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Integrating the SWR library with a type-safe API client
Once API responses in our app are loaded into the cache, we don’t need to wait to refetch them if another page needs them.
Isaac Seymour
We used GPT-4 during a hackathon—here's what we learned
We learned a lot about using OpenAI and which things to keep an eye on to decide when it’s worth revisiting.
Rory Bain
How our engineering team uses Polish Parties to maintain quality at pace
In a fast-moving company, quality cannot be delegated to a few individuals—it has to be a shared responsibility. One tool that helps us maintain our quality of work is Polish Parties. Here's how we run these crucial feedback sessions.
Leo Sjöberg